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Preface 


The purpose of this book is to provide a sound introduction to the theory, 
methods, and application of numerical analysis/computational mathematics. 
It is written with two primary objectives: 


(a) to provide an advanced graduate introduction to the theory and the meth- 
ods in numerical analysis that will help prepare students for taking doc- 
toral examinations in numerical analysis; and 


(b) to provide a solid foundation in numerical analysis for more specialized 
topics such as finite-element theory and application, advanced numerical 
linear algebra, optimization, or approximation of stochastic differential 
equations. 


Indeed, this book provides useful background knowledge for graduate study 
in any area of applied mathematics. 

The main topics in introductory numerical analysis include solution of non- 
linear equations, numerical linear algebra, ordinary differential equations, 
approximation theory, as well as, for example, numerical integration and 
boundary-value problems. These topics are introduced and examined in sep- 
arate chapters. Many examples are described to illustrate the concepts. The 
emphasis in the explanations is to provide a good understanding of the con- 
cepts. At the end of each chapter, analytical and computational exercises are 
provided. The exercises are designed to complement the material in the text 
and to illustrate how the theory and methods can be applied. An interesting 
feature of this book is the presentation of interval computation in numerical 
analysis. Throughout the text, explanations of interval arithmetic, interval 
computation, and interval algorithms are provided. 

There are many excellent books available on the theory, application, and 
computational methods of numerical analysis. Most of these texts, however, 
are either undergraduate texts with elementary theory and problems or spe- 
cialized treatments, for example, on numerical linear algebra or differential 
equations. The bibliography lists many of these books. It is hoped that the 
present book will complement these previous books in providing a more ad- 
vanced graduate-level introduction to the theory and methods in numerical 
analysis. Nevertheless, as numerical analysis is a large and rapidly expanding 
area of mathematics, we were forced to choose topics, computational methods, 
and analytical techniques that we thought would endure as well as provide a 
good background for understanding numerical techniques for more advanced 
problems. 
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One of the objectives of this book is to provide a clear and solid introduction 
to the theory and application of computational methods for applied mathe- 
maticians. The intent of this book is to provide a background to numerical 
methods so the reader will be in a position to apply the techniques and to un- 
derstand the mathematical literature in this area, including more specialized 
texts. To understand the material presented in this book, proficiency in lin- 
ear algebra and intermediate analysis is assumed. In particular, prerequisite 
courses for thoroughly understanding the concepts in this book include linear 
algebra, differential equations, advanced calculus, and intermediate analysis. 
In addition, some knowledge of scientific computing and programming is very 
helpful. Throughout the book, computational procedures are described, and 
to thoroughly understand these procedures, familiarity with some computer 
language such as MATLAB or Fortran is essential. 

For those readers with access to MATLAB, we have provided a set of MATLAB 
functions and scripts implementing various computations described in the 
text. These are available, either individually or as a complete “zip” file, on 
the web page 


http://interval.louisiana.edu/Classical-and-Modern-NA/ 


This web page gives short descriptions, as well as section numbers and exercise 
numbers corresponding to each function provided. It also cross-references sec- 
tion numbers to selected intrinsic capabilities of MATLAB, and does the same 
for interval arithmetic routines, publicly available for the book Introduction 
to Interval Analysis, by Ramon E. Moore, R. Baker Kearfott, and Michael J. 
Cloud, SIAM, Philadelphia, 2009. 

We are grateful to the University of Louisiana at Lafayette, Texas Tech 
University, and to George Mason University for providing us with the op- 
portunities to use this book in teaching both two-semester and one-semester 
(with selected topics) graduate courses in numerical analysis. We wish to 
thank our colleagues Christo Christov and Hongtao Yang for using this book 
in teaching the graduate numerical analysis course at UL Lafayette and for 
providing us with corrections. We are also grateful to the colleagues and 
graduate students who provided us with comments and corrections on the 
manuscript. In particular, we wish to thank Youssef Dib for his diligent work 
completing the figures, and Shuhua Hu and Xubo Wang for their help in typ- 
ing some of these notes in ATRX. We also wish to thank Anthony Holmes 
and Mark Thompson, who went out of their way to supply many valuable 
corrections and suggestions. The students in our graduate-level numerical 
analysis course who, while learning the material from preliminary copies of 
the manuscript, also made numerous corrections and suggestions, are worthy 
of mention, specifically, Ashley Avery, Ross Chiquet, Mark Delcambre, Frank 
Hammers, Xiaodong Lian, Tchavdar Marinov, Julie Roy, Christine Terranova, 
and Xing Yang. 
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Chapter 1 


Mathematical Review and Computer 
Arithmetic 


1.1 Mathematical Review 


In this section, several mathematical results and definitions are reviewed 
that will be useful throughout many of the following chapters. 


1.1.1 Intermediate Value Theorem, Mean Value Theorems, 
and Taylor’s Theorem 


Throughout, C” [a,b] will denote the set of real-valued functions f defined 
on the interval [a,b] such that f and its derivatives, up to and including its 
n-th derivative f‘™, are continuous on [a, b]. 


THEOREM 1.1 


(Intermediate value theorem) If f € Cla,b] and k is any number between 
m= min f(z) and M = max, f(x), then there exists a number c in [a,b] 


for which f(c) =k (Figure 1.1). 


FIGURE 1.1: Illustration of the Intermediate Value Theorem. 
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THEOREM 1.2 

(Mean value theorem for integrals) Let f be continuous and w be Riemann 
integrable! on [a,b] and suppose that w(x) > 0 for x € [a,b]. Then there exists 
a point € in [a,b] such that 


PROOF Let A = min f(z) and B = max, f(x). Then Aw(xz) < 
f(v)w(x) < Bw(x). Hence, — a 


b 


Af w(a)ae < [ v@see 2 Bf w(2)de. 


Thus, 
b 
ge LOE a, 
- fc w(a)dx 
By the Intermediate Value Theorem, there is a € in [a, b] such that 
b 
HO= J, w(a) f (2)dx 
a a 
J, w(a)da 


THEOREM 1.3 
(Taylor’s theorem) Suppose that f € C”*!|a,b]. Let xo € [a,b]. Then for 
any x € [a,b], 


f(z) = P(x) + Rn(x), where 
£™ (@o)(@ — 20)" 


P,(x) = f(xo) + f'(xo)(# — #0) + +++ + nl 


“1 
= S- at (wo) (a —2o)*, and 
k=0 ~ 
1 zx 
R,(x) = =| f° 7) (t)(@ — t)"dt (integral form of remainder). 


Furthermore, there is a € = €(x) between xo and x with 


_ FO E(a))(@ — 20)" 


Rn(2) (n+)! 


(Lagrange form of remainder). 


1In most, but not all contexts in numerical analysis, w will be continuous. 
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PROOF Recall the integration by parts formula fudv = uv — f vdu. 
Thus, 


f(x) — f(£o) -{ re t)dt (let u= f'(t), v=t— 2, dv = dt) 
= F"(wo)(v—a0) + fw - HF" (Hat 


(let w= f"(t), dv = (x — t)dt) 


(Ca 
2 


= F'(eo)(e = 00) - pro) + ff Sproat 


0 


xO 


Continuing this procedure, 
= ’ (z i ro)? ” 
f(@) = f(xo) + f'(xo0)(% — £0) + — >— fF" (20) 
Boye ME BO) (a — to)" (2 = Lo)" p(n) (ao) + +f (x — "FD (eat 


ao 


= P, (a) + Rn (2) 


it 
Now consider R,(x) = [ aa f+ (t)dt and assume that x9 < x (same 
n! 


argument if zp > 2). Then, by Theorem 1.2, 


2) = FO (E(a eG)” 5, sei) 3 (eae 0) 
a(n) = FP EEC) [| Fae = poe (ay) SO, 
where € is between zp and x and thus, € = €(2). 


An important special case of Taylor’s theorem is obtained with n = 0 (that 
is, directly from the Fundamental Theorem of Calculus). 


THEOREM 1.4 
(Mean value theorem) Suppose f € Cla, b|, x € [a,b], and y € [a,b] (and, 
without loss of generality, x < y). Then there is a € € [x,y] C [a,b] such that 


f(y) — fx) = f'()y - 2). 


Example 1.1 
Show that e? >1+2+ forall x >0. 0 
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PROOF By Taylor’s Theorem, 


e” = f(0) + f(0)(@— 0) + f"(0) 


? 


(x—-0)? | f? (w—t)?f"" (t)at 
a +f 2 


where f(x) = e”. Thus, 


x % 
ea1404+54 | =(x — t)*e'dt, 
0 


and it follows that e* >1+a+ x for x > 0, since 


“4 
i: 5 (a — t)eldt > 0 for x > 0. 
0 


Example 1.2 

Show that | Hee fe) — f'(x)|< ch for x,2 +h € [a,b], assuming that 
f €C?[a,0). [ 
PROOF 


fw +h) = f(@) 


fla) + flayh+ / (ot h—t)p"Ode— fla) 
= | 4 pry 


h 
< max |f”(t)| ri ch. 


a<t<b 


ale 


ath 
i (x th—t)f"(t)dt 


THEOREM 1.5 


(Taylor’s Theorem in Two Variables) Suppose that f(x,y) and all its partial 
derivatives of order less than or equal ton+1 are continuous in the rectangle 
D= {(a,y)|la<a<bce<y<d}. Let (ao, y0) € D. For every (x,y) € D, 
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there exists a € between x and xq and an n between y and yo such that 
f(x,y) = Palas) a Rn(a,y) where 


Pr(x,y) = f(®0, yo) + E = 20) (0, wo) + (y= 0) 5 (0s) Teo 


7 ys (i) (x — 20)" (y — yo)! 25 0 ¥o) and 


nl =a dx”—JDys 
1 n+1 ntl eis f Sie &,n 
Rn(x,y) (n+)! ( J ) @au oe wn) 


j=0 


1.1.2 Big “O” and Little “o” Notation 


We study “rates of growth” and “rates of decrease” of errors. For example, 
if we approximate e” by a first degree Taylor polynomial about x = 0, we get 


1 
e*—(1+h) = xe, 


where € is some unknown quantity between 0 and h. Although we don’t 
know exactly what e§ is, we know that it is nearly constant (in this case, 
approximately 1) for h near 0, so the error e” —(1+h) is roughly proportional 
to h? for h small. This approximate proportionality is often more important to 
know than the slowly-varying constant e. The big “O” and little “o” notation 
are used to describe and keep track of this approximate proportionality. 


DEFINITION 1.1 (Big O and little o notation) Let {x,} and {az} be 
two sequences. We write x, = O(ax) as k > o0 if there are constants c and r 
such that |x~| <clax| when k > r. We write x, = o(ag) if, given any € > 0, 
there is an r such that |xx| < €lax| fork >r. Similarly, if f(h) and g(h) are 
functions of a continuous variable h, we say that f(h) = O(g(h)) ash > 0 
provided there are constants c and 6 such that |f(h)| < clg(h)| for |h| < 6. 


Note that 
if ry = O(ag), then lim |—] <c. 
kc | Ak 
Similarly, 
if 7, = then lim |—|=0. 
if x, = o(ax), then im, an 


The “O” denotes “order.” For example, if f(h) = O(h?), we say that “f 
exhibits order 2 convergence to 0 as h tends to 0.” 
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REMARK 1.1 = The “big O” notation is often used in the context of error 


analysis for vector-valued functions. Let e(h) be a vector that represents an 
error in some other vector; we then might say 


e(h)=O(h*) as h- 0, 


provided there exists a constant c independent of h, such that ||e(h)|| < ch” 
for all h € [0, ho]. (Roughly, for sufficiently small h, ||e(h)|| does not go to 
zero slower than h*.) 


Example 1.3 
tp = el/* —1/k—-1, an = B- Then x, = O(ag). 


PROOF By Taylor’s Theorem, 
1/k 
elk — 9 + 6%(1/k — 0) +f e'(1/k — t)dt. 
0 
Thus eb Hippe & a fork >1 0 
= Bee D oa i 
The following definition extends this notation to functions. 


DEFINITION 1.2 Suppose that lim f(h) = L. We write f(h) -—L = 


h)-L 
O(g(h)) if there is a constant K such that re < K for sufficiently 
g 
small h > 0. 
Example 1.4 
Show that cosh — 1+ ee ZORA): 
PROOF By Taylor’s Theorem, 
a te a 
cosh =1——+— | cos(t)(h — t)*dt. 
2° BJ 
Thus, 
fel ee 2 ht 
petit lx = =—, 
cos h 1+5lsqf t)°dt >A 
Hence, 


|cosh -—14+ ©| 1 
es 


[ne] = 94° 
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1.1.3 Convergence Rates 


DEFINITION 1.3 Let {x} be a sequence with limit a*. If there are 
constants C and a and an integer N such that |xx41 —x*| < Cla, —ax* | for 
k > N we say that the rate of convergence is of order at least a. If a= 1 
(with C <1), the rate is said to be linear. If w = 2, the rate is said to be 
quadratic. 


Example 1.5 
ge +4 
Show that the sequence {x;,} defined by rp41 = 5 ,&=0,1,2,---, with 
Uk 
xo = 3, converges quadratically to x* = 2. 
PROOF 
(Part 1: showing convergence) It is first shown that lim a, = 2. If x, > 2, 


Co 


then x? —4>0. Thus, 2%? > «3 +4. Hence, it follows that 


4 
= VMk+1- (1.1) 


Combining this with —(a,—2)? < 0 gives 4x,—27—4 < 0. Hence, 4a, < x7+4 

and thus, 

ps +4 
20h 

By (1.1) and (1.2), x, > @41 > 2. Hence, 


2< 


= Vk+1- (1.2) 


xo > @1 > XQ > 13>) > ap > > 2. 


Therefore, {x,} is a monotonically decreasing sequence bounded below by 2 
(w*)* +4 
2x* 
(Part 2: showing that the convergence is quadratic) We see that 


and thus has a limit #*. But 7* = gives x* = 2. 


244 244-4 
|tep1 — 2| = 7 29) ee 
20K 220K 
— 2)? 1 
_ | (te = 2)" < = |x, — 2/7, 
20k 4 


since x, > 2. Therefore, {x,} converges quadratically to 2. 


Quadratic convergence is very fast. For the above example, if |a, — 2| = 
0.01 then |ap41 — 2) < $(0.01)?, |zx42 — 2| < G(0.01)*, .... We give some 
computational results for this example with xp = 3: 
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k 

0} 3 

1 | 2.16667 
2 | 2.00641 
3 | 2.00001 
4 | 2.00000 


Sometimes, it is not practical to devise a computation to exhibit quadratic 
convergence. In such cases, however, a convergence rate that is almost as fast 
can sometimes be achieved: 


DEFINITION 1.4 A sequence {xx} with limit a* is said to exhibit su- 
perlinear convergence, provided 


[trp — "| 


—-O0Oask—o. 
|x, — x*| 


Superlinear convergence is faster than linear, but can, in principle, be slower 
than convergence of order a, for any a > 1. 


Example 1.6 
The sequence {x,} defined by 


is superlinearly convergent to 0, since 


eee Ol OF Sf f1N 
“le, 0] GPO? a) (ome) PO 


However, {x,} is not quadratically convergent, since 


pO) 17o 


eh = ay = 0 as k > 0. 
lee 0P ~ 1/Q")? Goneey we 


In fact, it can be shown that {x,} is not convergent with convergence order 
a for any a > 1. 


1.2 Computer Arithmetic 


In numerical solution of mathematical problems, two common types of error 
are: 
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1. Method (algorithm or truncation) error. This is the error due to ap- 
proximations made in the numerical method. 


2. Rounding error. This is the error made due to the finite number of digits 
available on a computer. 


Example 1.7 
By the mean value theorem for integrals (Theorem 1.2, as in Example 1.2 on 
page 4), if f € CO? [a,b], then 


f(o) = Es | (Det h— tat 


ath 
a / f" (t)\(a +h — t)dt 


and <ch. 


Thus, f’(a) © (f(a +h) — f(x))/h, and the error is O(h). We will call this 
the method error or truncation error, as opposed to roundoff errors due to 
using machine approximations. 


Now consider f(a) = nz and approximate f’(3) ~ mGrh)—n8 for h small 
using a calculator having 11 digits. The following results were obtained. 


In(3 + h) — In(3) 


h Error = ~ — ‘s = O(h) 
0.3278982 5.44 x10-% 
0.332779 bb4d K10>* 
0.3332778 bea x10r® 
0.333328 hoo 610 -" 
0.333330 a3o 610-8 
0.333300 B38 X10? 

aoa x10" 
533x107" 
338K 10- 2 
333 x1071 


One sees that, in the first four steps, the error decreases by a factor of 10 
as h is decreased by a factor of 10 (That is, the method error dominates). 
However, starting with h = 0.00001, the error increases. (The error due to a 
finite number of digits, i.e., roundoff error dominates). 


REMARK 1.2 Problems with round-off error have contributed to several 
major disasters in the real-world. One such example is the Patriot Missile 
failure, in Dharan, Saudi Arabia, on February 25, 1991. This incident resulted 
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in 28 deaths, and was ultimately attributable to poor handling of roundoff 
error. 


REMARK 1.3 There are two possible ways to reduce rounding error: 


1. The method error can be reduced by using a more accurate method. 
This allows larger h to be used, thus avoiding roundoff error. Consider 


f(2)= fern fen) + {error}, where {error} is O(h?). 


In(3 + h) — In(3 — h) 
h —— > error 
0.3334568 1.24°x10-* 
0.3333345 ee On as 


0.3333333 1.91 x1078 


The error decreases by a factor of 100 as h is decreased by a factor of 10. 


2. Rounding error can be reduced by using more digits of accuracy, such 
as using double precision (or multiple precision) arithmetic. 


0 


To fully understand and avoid roundoff error, we should study some de- 
tails of how computers and calculators represent and work with approximate 
numbers. 


1.2.1 Floating Point Arithmetic and Rounding Error 


Let @ = {a positive integer}, the base of the computer system. (Usually, 
@ = 2 (binary) or 6 = 16 (hexadecimal)). Suppose a number x has the exact 
base representation 


x = (+0.a1a2Q03°--a40441°°:)B" =+¢q8", 


where q is the mantissa, 3 is the base, m is the exponent, 1 < a, < @—1 and 
O<a;<@-1fori>1. 

On a computer, we are restricted to a finite set of floating-point numbers 
F = F(6G,t, L,U) of the form a* = (+0.a1a2---a4)0™, where 1 < a; < B-1, 
0<a,<68-1for2<i<t, LD<m<U, and t is the number of digits. (In 
most floating point systems, L is about —64 to —1000 and U is about 64 to 
1000.) 
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Example 1.8 
(binary) 6 = 2 


8 
I 


1 1 1 1 
; 1011)2? = {1x = = = — 
(0.1011) (ix 5+oxttixgtixa) xs 


11 
a co 5.5 (decimal). 


REMARK 1.4 Most numbers cannot be exactly represented on a com- 
puter. Consider « = 10.1 = 1010.0001 1001 1001 (@ = 2). If Z = —127, UU = 
127, t = 24, and G = 2, then x © x* = (0.10100001 1001 1001 1001 1001)2?. 


Question: Given a real number z, how do we define a floating point number 
fl(z) in F, such that fl(a) is close to x? 
Here are two possible procedures? for choosing fl(z): 


1. Chop: fl(a) is that element of F such that fl(a) = sgn(a)(0.a1a2---a4)8™, 
where a; = a;, 1 <i < t, where |2| has infinite base @ expansion 


|x| = (O0.a1aq-+- azae41-°-+) x B™, (1.3) 
and where 
1 ife>0, 
a { -1 ifx <0. (L4) 


2. Round: fi(x) is that element of F closest to x, and if x is exactly between 
two elements of F’, then fi(a) is generally taken to be the element of 
largest magnitude. Thus, for round, if |a| has base G expansion as in 
(1.3), let 


1 —tam * Ok * Ok m 
Iz] + 58 tgm — (O.aja3---afaz,,--:)6™. (1.5) 
Then, fl(a) = sgn(a)(0.aja3---af)3B™. 
Example 1.9 
B=10,t=5, « = 0.12345666--- x 107. Then 


f(x) = 0.12345 x 107 (chopped). 
f(x) = 0.12346 x 107 (rounded). 


2On most modern machines, four rounding modes are actually chosen. See Section 1.2.2, 
starting on page 17. 
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(Note: e+ $87'8™ = 0.12345666 ---x 107+ 410~°10" = 0.12346166 -- -x 10”.) 


See Figure 1.2 for an example with G = 10 and t = 1. In that figure, the 


exhibited floating point numbers are (0.1) x 101, (0.2) x 101, ..., (0.9) x 10, 
0.1 x 10?. 
gm-t — 10° =| 
SSS SS SS 
gm-t =] | | em = 101 


successive floating point numbers 


FIGURE 1.2: An example floating point system: 6 = 10, t = 1, and 
m=1. 


We now have the following error bound. 


THEOREM 1.6 


In — Ala)| < Slela, 


where p=1 for rounding and p = 2 for chopping. 


PROOF | Since x = (40.a;a2---a;:--)3™, we have B™-! < |a| < B™. 
In the interval [3’"~', B'], the floating point numbers are evenly spaced with 
separation 3 —*. Thus, for chopping, 


lz — A(a)| < Bt = Sa, 


and for rounding, 
1 
|x — f(x) < ae = es 


Hence, 


P gmt < P ai-tam-1 ¢ 1),) 1- 
jz fi(a)| < 58" < SBP tB™™ < sla|B'p. 


REMARK 1.5 6= spit is called the unit roundoff error. 0 
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REMARK 1.6 Let « = ices Then fl(x) = (1+ €)z, where |e| < 4. 


O 


Now consider errors produced in the arithmetic operations «+ y, x—y, ry, 
and a/y. 


THEOREM 1.7 
Let © denote the operation +,—,x, or +, and let x and y be machine num- 
bers. Then 


fie © y) =e y)(1 +), where |e| <5 = 5 Bt. 


PROOF Theorem 1.6 gives 


P ai- 
Jeoy—AM(xOy)| <|eOyl5B™. 


Thus, 
—|2Oyl5 Bi <-cOytf(woy)< 5 Ble © yl. 
Hence, 
‘S) ‘S) 
(eo y) (1- FOU a-t) < geo) < (wow) (14 EOW2 prt), 
rOy 2 roy 2 
It follows that f(z © y) = (1+ €)(x@y) where |e| < §6'* =. 


Example 1.10 
B=10,t=4,p=1. (Thus, 6 = 41073 = 0.0005.) Let x = 0.5795 x 10°, 
y = 0.6399 x 10°. Then 


fl(a + y) = 0.1219 x 10° =(x+y)(1+ 1), ©: =0.00033 <6, and 
f(xy) =0.3708x10!°= (ay)(1+ 2), €2 = 0.000059 < 6. 


(Note: « + y = 0.12194 x 10°, ay = 0.37082205 x 1022.) 


Let’s apply our result to a more complicated problem. Let x, and x2 be 
two exact numbers. Consider 


f(z + 22) = filfl(a1) + fl(x2)| = fllai(1 + €1) + 22(1 + €2)] 
= [x1 (1 + €1) + va(1 + €2)] (1 + €1), 


where |é;| < 6, |é2| < 6, and |e,| < 6. Similarly, 


flay + 2 + x3) = ((ai(1 + €1) + wa(1 + €2))(1 + 1) + v3(1 + €3))(1 + €2) 
— xy1(1 + é)(1 + €1)(1 + €2) + xo(1 + éo)(1 Sle: €1)(1 + €2) 
+ x3(1 + é3)(1 + €2) 
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Continuing this procedure, 


n n-1 
#(Som] = £1( (1+ &) TTC (1+ 6) 


+ x2(1 + é2) IIe + €;) + x3(1 + é3) Tle + €;) 
+ x4(1 + €4) 7 (1+ 6) +++++an(1+ én)(1 + €n_1). 
1=3 


Considering this expression, it is clear that, to reduce rounding error, a se- 
quence should be summed from small numbers to large numbers on a com- 
puter. 


Example 1.11 
Suppose 3 = 10 and t = 4 (4 digit arithmetic), suppose x; = 10000 and 
v2 3 pees X1001 1. Then 


f(x, + 22) = 10000, 
fl(ai + v2 + x3) = 10000, 


100 
fi (s 7) = 10000, 
i=l 
when we sum forward from x,. But going backwards, 


f(x1001 + £1000) = 2, 
fi(x1001 + 1000 + X999) = 3, 


1 
a( S- «) = 11000, 


i=1001 
which is the correct sum. 0 


We now have some useful definitions. 


DEFINITION 1.5 = Let «* be an approximation to x. Then |x — a*| is 


—' ae* 
| is called the relative error. 


called the absolute error, and 
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x — fi(x) 
x 


<6= ; G'—* (unit roundoff error). 


For example, | 


REMARK 1.7 Large relative errors can occur when two nearly equal 
numbers are subtracted on a computer. 


Example 1.12 
x, = 15.314768, x2 = 15.314899, B = 10, t = 6 (6-digit decimal accuracy). 
Then x2 — 21 © Al(x2) — f(xy) = 15.3149 — 15.3148 = 0.0001. Thus, 


a — 21 — (f(a) —f(e1)) | 0.000131 — 0.0001 
@2— 2} ~ ~~ 0.000131 
= 0.237 
= 23.7% relative accuracy. 


REMARK 1.8 Sometimes, an algorithm can be modified to reduce round- 
ing error, such as when the rounding error is caused by subtraction of nearly 
equal quantities. 


Example 1.13 


Consider finding the roots of az? + br + c = 0, where b? is large compared 
with |4ac|. The most common formula for the roots is 


—b+ Vb? — 4ac 
1.2 = ———_. 
: 2a 


Consider x? + 100x + 1 0, 2 10, t 4, p = 2, and 4-digit chopped 
arithmetic. Then 


—100 + 9996 eee —100 — V9996 


2 ne ae 


t= 9 ) 


but 9996 = 99.97 (4 digit arithmetic chopped). Thus, 


—100 + 99.97 —100 — 99.97 
== es 


1~ 


Hence, x1 © —0.015, xy © —99.98, but 2, = —0.010001 and x2 = —99.989999, 
so the relative errors in 7; and x2 are 50% and 0.01%, respectively. 
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Let’s change the algorithm. Assume b > 0 (can always make b > 0). Then 


b+ JVB se ( b- VP =) 


2a —b— Vb? — 4ac 
= 4dac _ —2c¢ 
~ 2a(—b— Vb? —4ac) b+ Vb? — ac’ 
and 
ee ee 
LQ = —— (the same as before). 
a 
Then, for the above values, 
~2(1) =, 
Ly = J. » — — = —- 0.0100. 
‘100 + 9996 100 + 99.97 
Now, the relative error in 21 is also 0.01%. ] 


Let us now consider error in function evaluation. Consider a single valued 
function f(x) and let 2* = f(x) be the floating point approximation of «. 
Therefore the machine evaluates f(a*) = f(fl(x)), which is an approximate 
value of f(a) at « = 2*. Then the perturbation in f(a) for small perturbations 
in « can be computed via Taylor’s formula. This is illustrated in the next 
theorem. 


THEOREM 1.8 


The relative error in functional evaluation is, 


i — f(e")) _ |e fe") 


ra 


fi) || f@) |] # 

PROOF The linear Taylor approximation of f(x) about f(«*) for small 
values of |a — 2*| is given by f(a) = f(a*) + f’(a*)(a — x*). Rearranging the 
terms immediately yields the result. ] 


We now define the condition number of a function f(x) as 


a f'(a*) 
f(x) 


which describes how large the relative error in function evaluation is with 
respect to the relative error in the machine representation of x. In other 
words, Ky(x*) is a measure of the degree of sensitivity of the function at 
CS 2": 


Kp (0) = 
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Example 1.14 
Let f(x) = /z. The condition number of f(a) about x = 2* is 


1 
Pee oe = 
Kf (x ) —_ fx oh _ 2° 

This suggests that f(x) is well-conditioned. 


Example 1.15 
Let f(x) = Vx — 2. The condition number of f(x) about x = x* is 


* = x 

nA") |e] 

This is not defined at x* = 2. Hence the function f(x) is numerically unstable 
and ill-conditioned for values of x close to 2. 


REMARK 1.9 If « = f(x) = 0, then the condition number is simply 
\f’(a)|. If a = 0, f(z) 4 0 (or f(x) = 0,x 4 0) then it is more useful 
to consider the relation between absolute errors than relative errors. The 
condition number then becomes | f’(x)/f (x)|. 


REMARK 1.10 Generally, if a numerical approximation Z to a quantity 
z is computed, the relative error is related to the number of digits after the 
decimal point that are correct. For example if z = 0.0000123453 and z = 
0.00001234543, we say that Zz is correct to 5 significant digits. Expressing 
z as 0.123453 x 1074 and 2 as 0.123454 x 1074, we see that if we round 2% 
to the nearest number with five digits in its mantissa, all of those digits are 
correct, whereas, if we do the same with six digits, the sixth digit is not 
correct. Significant digits is the more logical way to talk about accuracy in 
a floating point computation where we are interested in relative error, rather 
than “number of digits after the decimal point,” which can have a different 
meaning. (Here, one might say that 2% is correct to 9 digits after the decimal 
point.) 


1.2.2 Practicalities and the IEEE Floating Point Standard 


Prior to 1985, different machines used different word lengths and different 
bases, and different machines rounded, chopped, or did something else to form 
the internal representation fl(x) for real numbers x. For example, IBM main- 
frames generally used hexadecimal arithmetic (3 = 16), with 8 hexadecimal 
digits total (for the base, sign, and exponent) in “single precision” numbers 
and 16 hexadecimal digits total in “double precision” numbers. Machines such 
as the Univac 1108 and Honeywell Multics systems used base 6 = 2 and 36 
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binary digits (or “bits” ) total in single precision numbers and 72 binary digits 
total in double precision numbers. An unusual machine designed at Moscow 
State University from 1955-1965, the “Setun” even used base-3 (3 = 3, or 
“ternary” ) numbers. Some computers had 32 bits total in single precision 
numbers and 64 bits total in double precision numbers, while some “super- 
computers” (such as the Cray-1) had 64 bits total in single precision numbers 
and 128 bits total in double precision numbers. 

Some hand-held calculators in existence today (such as some Texas Instru- 
ments calculators) can be viewed as implementing decimal (base 10, 3 = 10) 
arithmetic, say, with L = —999 and U = 999, and t = 14 digits in the man- 
tissa. 

Except for the Setun (the value of whose ternary digits corresponded to 
“positive,” “negative,” and “neutral” in circuit elements or switches), digital 
computers are mostly based on binary switches or circuit elements (that is, 
“on” or “off’), so the base 3 is usually 2 or a power of 2. For example, the 
IBM hexadecimal digit could be viewed as a group of 4 binary digits®. 

Older floating point implementations did not even always fit exactly into 
the model we have previously described. For example, if x is a number in the 
system, then —xz may not have been a number in the system, or, if x were a 
number in the system, then 1/z may have been too large to be representable 
in the system. 

To promote predictability, portability, reliability, and rigorous error bound- 
ing in floating point computations, the Institute of Electrical and Electronics 
Engineers (IEEE) and American National Standards Institute (ANSI) pub- 
lished a standard for binary floating point arithmetic in 1985: IEEE/ANSTI 
754-1985: Standard for Binary Floating Point Arithmetic, often referenced as 
“‘TEEE-754,” or simply “the IEEE standard*.” Almost all computers in exis- 
tence today, including personal computers and workstations based on Intel, 
AMD, Motorola, etc. chips, implement most of the IEEE standard. 

In this standard, 6 = 2, 32 bits total are used in a single precision number 
(an “IEEE single”), and 64 bits total are used for a double precision number 
(“IEEE double” ). In a single precision number, 1 bit is used for the sign, 8 bits 
are used for the exponent, and t = 23 bits are used for the mantissa. In double 
precision numbers, | bit is used for the sign, 11 bits are used for the exponent, 
and 52 bits are used for the mantissa. Thus, for single precision numbers, 
the exponent is between 0 and (11111111). = 255, and 128 is subtracted 
from this, to get an exponent between —127 and 128. In IEEE numbers, 
the minimum and maximum exponent are used to denote special symbols 
(such as infinity and “unnormalized” numbers), so the exponent in single 
precision represents magnitudes between 2~17° ~ 10798 and 2127 = 1038. The 


3 An exception is in some systems for business calculations, where base 10 is implemented. 
4An update to the 1985 standard was made in 2008. This update gives clarifications of 
certain ambiguous points, provides certain extensions, and specifies a standard for decimal 
arithmetic. 
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mantissa for single precision numbers represents numbers between (2° = 1 
and pane ~? = 2(1—27*4) = 2. Similarly, the exponent for double precision 
numbers is, effectively, between 2~19?? = 10798 and 21073 ~ 1038, while the 
mantissa for double precision numbers represents numbers between 2° = 1 


TABLE 1.1: Parameters for 
IEEE arithmetic 


procision 


single 127 | 23 


2] -126 
double | 2 | -1022 | 1023 | 52 


In many numerical computations, such as solving the large linear systems 
arising from partial differential equation models, more digits or a larger ex- 
ponent range is required than is available with IEEE single precision. For 
this reason, many numerical analysts at present have adopted IEEE double 
precision as the default precision. For example, underlying computations in 
the popular computational environment MATLAB are done in IEEE double 
precision. 

IEEE arithmetic provides four ways of defining fl(x), that is, four “rounding 
modes,” namely, “round down,” “round up,” “round to nearest,” and “round 
to zero,” are specified as follows. 


> an < 


round down: fi(x) = x |, the nearest machine number to the real number 
x that is less than or equal to x 


round up: fl(z) = x [, the nearest machine number to the real number x 
that is greater than or equal to «. 


round to nearest: fl(z) is the nearest machine number to the real number 
x. 


round to zero: fi(x) is the nearest machine number to the real number x 
that is closer to 0 than x. This corresponds to “chop” as explained on 
page 11. 


The four elementary operations +, —, x, and / must be such that fl(a © y) 
is implemented for all four rounding modes, for © € {-, +,x,/,v- i. 

The default mode (if the rounding mode is not explicitly set) is normally 
“round to nearest,” to give an approximation after a long string of compu- 
tations that is hopefully near the exact value. If the mode is set to “round 
down” and a string of computations is done, then the result is less than or 
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equal to the exact result. Similarly, if the mode is set to “round up,” then 
the result of a string of computations is greater than or equal to the exact 
result. In this way, mathematically rigorous bounds on an exact result can be 
obtained. (This technique must be used astutely, since naive use could result 
in bounds that are too large to be meaningful.) 

Several parameters more directly related to numerical computations than 
L, U, and t are associated with any floating point number system. These are 


HUGE: the largest representable number in the floating point system; 
TINY: the smallest positive representable number in the floating point system. 


Em: the machine epsilon, the smallest positive number which, when added to 
1, gives something other than 1 when using the rounding mode—round 
to the nearest. 


These so-called “machine constants” appear in Table 1.2 for the IEEE single 
and IEEE double precision number systems. 


TABLE 1.2: Machine constants for IEEE arithmetic 
Precision em 
single | 21°77 ~3.40-10°8 | 2716 ~1.18-107%8 | 2-73 = 119-107" 
double: | 2°*4:291,79:« 10°°* || 3-297? 4 293. 1078 | 9°? 299. 10- 


For IEEE arithmetic, 1/TINY < HUGE, but 1/HUGE < TINY. This brings up 
the question of what happens when the result of a computation has absolute 
value less than the smallest number representable in the system, or has abso- 
lute value greater than the largest number representable in the system. In the 
first case, an underflow occurs, while, in the second case, an overflow occurs. 
In floating point computations, it is usually (but not always) reasonable to 
replace the result of an underflow by 0, but it is usually more problematical 
when an overflow occurs. Many systems prior to the IEEE standard replaced 
an underflow by 0 but stopped when an overflow occurred. 

The IEEE standard specifies representations for special numbers co, —ov, 
+0, —0, and NaN, where the latter represents “not a number.” The standard 
specifies that computations do not stop when an overflow or underflow occurs, 
or when quantities such as /—I, 1/0, —1/0, etc. are encountered (although 
many programming languages by default or optionally do stop). For example, 
the result of an overflow is set to oo, whereas the result of /—1 is set to NaN, 
and computation continues. The standard also specifies “gradual underflow,” 
that is, setting the result to a “denormalized” number, or a number in the 
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floating point format whose first digit in the mantissa is equal to 0. Com- 
putation rules for these special numbers, such as NaN x any number = Na\, 
co X any positive normalized number = oo, allow such “nonstop” arithmetic. 

Although the IEEE nonstop arithmetic is useful in many contexts, the nu- 
merical analyst should be aware of it and be cautious in interpreting results. 
In particular, algorithms may not behave as expected if many intermediate 
results contain oo or NaN, and the accuracy is less than expected when denor- 
malized numbers are used. In fact, many programming languages, by default 
or with a controllable option, stop if co or NaN occurs, but implement IEEE 
nonstop arithmetic with an option. 


Example 1.16 

(Illustration of underflow and overflow) Suppose, for the purposes of illustra- 
tion, we have a system with @ = 10, t = 2 and one digit in the exponent, so 
that the positive numbers in the system range from 0.10 x 107° to 0.99 x 10°, 
and suppose we wish to compute N = \/a7 +23, where x1 = x2 = 10°. Then 
both 2; and «x2 are exactly represented in the system, and the nearest floating 
point number in the system to N is 0.14 x 107, well within range. However, 
x? = 10”, larger than the maximum floating point number in the system. 
In older systems, an overflow usually would result in stopping the compu- 
tation, while in IEEE arithmetic, the result would be assigned the symbol 
“Infinity.” The result of adding “Infinity” to “Infinity” then taking the square 
root would be “Infinity,” so that N would be assigned “Infinity.” Similarly, 
if x1 = v2 = 10~®, then x? = 107)”, smaller than the smallest representable 
machine number, causing an “underflow.” On older systems, the result is usu- 
ally set to 0. On IEEE systems, if “gradual underflow” is switched on, the 
result either becomes a denormalized number, with less than full accuracy, 
or is set to 0; without gradual underflow on IEEE systems, the result is set 
to 0. When the result is set to 0, a value of 0 is stored in N, whereas the 
closest floating point number in the system is 0.14 x 10~°, well within range. 
To avoid this type of catastrophic underflow and overflow in the computation 
of N, we may use the following scheme. 


1. s — max{|x4|, |x2|}- 


2. m — 41/8; No — “o/s. 


3. N — sy\/ni +n. 


1.2.2.1 Input and Output 


For examining the output to large numerical computations arising from 
mathematical models, plots, graphs, and movies comprised of such plots and 
graphs are often preferred over tables of values. However, to develop such 
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models and study numerical algorithms, it is necessary to examine individual 
numbers. Because humans are trained to comprehend decimal numbers more 
easily than binary numbers, the binary format used in the machine is usually 
converted to a decimal format for display or printing. In many programming 
languages and environments (such as all versions of Fortran, C, C++, and 
in MATLAB), the format is of a form similar to +d, .dgd3...d,,e+016203, or 
td, .dod3...d Ed 16263, where the “e” or “E” denotes the “exponent” of 
10. For example, -1.00e+003 denotes —1 x 10? = —1000. Numbers are 
usually also input either in a standard decimal form (such as 0.001) or in 
this exponential format (such as 1.0e-3). (This notation originates from 
the earliest computers, where the only output was a printer, and the printer 
could only print numerical digits and the 26 upper case letters in the Roman 
alphabet.) 


Thus, for input, a decimal fraction needs to be converted to a binary float- 
ing point number, while, for output, a binary floating point number needs 
to be converted to a decimal fraction. This conversion necessarily is inexact. 
For example, the exact decimal fraction 0.1 converts to the infinitely repeat- 
ing binary expansion (0.00011)2, which needs to be rounded into the binary 
floating point system. The IEEE 754 standard specifies that the result of a 
decimal to binary conversion, within a specified range of input formats, be 
the nearest floating point number to the exact result, over a specified range, 
and that, within a specified range of formats, a binary to decimal conversion 
be the nearest number in the specified format (which depends on the number 
m of decimal digits requested to be printed). 


Thus, the number that one sees as output is usually not exactly the num- 
ber that is represented in the computer. Furthermore, while the floating 
point operations on binary numbers are usually implemented in hardware or 
“firmware” independently of the software system, the decimal to binary and 
binary to decimal conversions are usually implemented separately as part of 
the programming language (such as Fortran, C, C++, Java, etc.) or software 
system (such as MATLAB). The individual standards for these languages, if 
there are any, may not specify accuracy for such conversions, and the lan- 
guages sometimes do not conform to the IEEE standard. That is, the number 
that one sees printed may not even be the closest number in that format to 
the actual number. 


This inexactness in conversion usually does not cause a problem, but may 
cause much confusion in certain instances. In those instances (such as in 
“debugging,” or finding programming blunders), one may need to examine 
the binary numbers directly. One way of doing this is in an “octal,” or base-8 
format, in which each digit (between 0 and 7) is interpreted as a group of 
three binary digits, or in hexadecimal format (where the digits are 0-9, A, B, 
C, D, E, F), in which each digit corresponds to a group of four binary digits. 
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1.2.2.2 Standard Functions 


To enable accurate computation of elementary functions such as sin, cos, 
and exp, IEEE 754 specifies that a “long” 80-bit register (with “guard digits” ) 
be available for intermediate computations. Furthermore, [EEE 754-2008, an 
official update to IKEE 754-1985, provides a list of functions it recommends be 
implemented, and specifies accuracy requirements (in terms of correct round- 
ing), for those functions a programming language elects to implement. 


REMARK 1.11 = Alternative number systems, such as variable precision 
arithmetic, multiple precision arithmetic, rational arithmetic, and combina- 
tions of approximate and symbolic arithmetic have been investigated and 
implemented. These have various advantages over the traditional floating 
point arithmetic we have been discussing, but also have disadvantages, and 
usually require more time, more circuitry, or both. Eventually, with the ad- 
vance of computer hardware and better understanding of these alternative 
systems, their use may become more ubiquitous. However, for the foreseeable 
future, traditional floating point number systems will be the primary tool in 
numerical computations. 


1.3. Interval Computations 


Interval computations are useful for two main purposes: 


e to use floating point computations to compute mathematically rigor- 
ous bounds on an exact result (and hence to rigorously bound roundoff 
error); 


e to use floating point computations to compute mathematically rigorous 
bounds on the ranges of functions over boxes. 


In complicated traditional floating point algorithms, naive arrangement of in- 
terval computations usually gives bounds that are too wide to be of practical 
use. For this reason, interval computations have been ignored by many. How- 
ever, used cleverly and where appropriate, interval computations are powerful, 
and provide rigor and validation when other techniques cannot. 

Interval computations are based on interval arithmetic. 


1.3.1 Interval Arithmetic 


In interval arithmetic, we define operations on intervals, which can be con- 
sidered as ordered pairs of real numbers. We can think of each interval as 
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representing the range of possible values of a quantity. The result of an op- 
eration is then an interval that represents the range of possible results of the 
operation as the range of all possible values, as the first argument ranges over 
all points in the first interval and the second argument ranges over all values 
in the second interval. To state this symbolically, let x = [z,%] and y = [y, J], 
and define the four elementary operations by 7 


rOoy={roy|xreaxvand ye y} foro € {4+,-, x, +}. (1.6) 


Interval arithmetic’s usefulness derives from the fact that the mathematical 
characterization in Equation (1.6) is equivalent to the following operational 
definitions. 


et+y=(xt+y,z+9, 


xx y = [min{zy, 29, Ty, Ty}, max{xy, xy, Ty, Ty}] (1.7) 
de C731 
—=[=,-] ifz>Oorr<0 
x cx 

1 


er>y=@2xX— 


The ranges of the four elementary interval arithmetic operations are ex- 
actly the ranges of the corresponding real operations, but, if such operations 
are composed, bounds on the ranges of real functions can be obtained. For 
example, if 


f(z) = («@4+1)(a-1), (1.8) 
then 


F([-2, 2]) = ([-2, 2] + 1) ([-2, 2] - 1) = [-1,3][-83, 1] = [-9, 3}, 


which contains the exact range [—1, 3]. 


REMARK 1.12 In some definitions of interval arithmetic, division by 
intervals containing 0 is defined, consistent with (1.6). For example, 


28,-[-=-dJufis]-*\(43) 


where R* is the extended real number system,° consisting of the real numbers 
with the two additional numbers —oo and co. This extended interval arith- 


5also known as the two-point compactification of the real numbers 
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metic® was originally invented by William Kahan’ for computations with con- 
tinued fractions, but has wider use than that. Although a closed system can 
be defined for the sets arising from this extended arithmetic, typically, the 
complements of intervals (i.e., the unions of two semi-infinite intervals) are 
immediately intersected with intervals, to obtain zero, one, or two intervals. 
Interval arithmetic can then proceed using (1.7). 


The power of interval arithmetic lies in its implementation on computers. In 
particular, outwardly rounded interval arithmetic allows rigorous enclosures 
for the ranges of operations and functions. This makes a qualitative difference 
in scientific computations, since the results are now intervals in which the 
exact result must lie. It also enables use of floating point computations for 
automated theorem proving. 

Outward rounding can be implemented on any machine that has downward 
rounding and upward rounding, such as any machine that complies with the 
IEEE 754 standard. For example, take x+ y = [x +y,%+9]). Ifa2+y 
is computed with downward rounding, and = + 7 is computed with upward 
rounding, then the resulting interval z = [z,Z] that is represented in the 
machine must contain the exact range of «+ y for x € x and y € y. We call 
the expansion of the interval from rounding the lower end point down and the 
upper end point up roundout error. 

Interval arithmetic is only subdistributive. That is, if x, y, and z are 
intervals, then 


a(y+z)C avy+axz, but e(y+ z) £ ey + xz in general. (1.9) 


As a result, algebraic expressions that would be equivalent if real values are 
substituted for the variables are not equivalent if interval values are used. For 
example, if, instead of writing (x — 1)(a +1) for f(x) in (1.8), suppose we 
write 


f(z) = 2° = 1, (1.10) 


and suppose we provide a routine that computes an enclosure for the range 
of x? that is the exact range to within roundoff error. Such a routine could 
be as follows: 


ALGORITHM 1.1 
(Computing an interval whose end points are machine numbers and which 
encloses the range of x.) 


6There are small differences in current definitions of extended interval arithmetic. For 
example, in some systems, —co and oo are not considered numbers, but just descriptive 
symbols. In those systems, [1, 2]/[—3, 4] = (—oo, —1/3] U [1/4,00) = R\(—1/3, 1/4). See 
[71] for a theoretical analysis of extended arithmetic. 

“who also was a major contributor to the IEEE 754 standard 
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INPUT: a = [z, 7]. 
OUTPUT: a machine-representable interval that contains the range of x? over 
x. 


IF « > 0 THEN 


RETURN [2?,77], where x? is computed with downward rounding and 


z is computed with upward rounding. 


ELSE IF % < 0 THEN 


RETURN [2?,x?], where 2 is computed with downward rounding and 


a? is computed with upward rounding. 


ELSE 


1. Compute x? and & with both downward and upward rounding; that is, 
compute x? and x2 such that x? and x? are machine representable num- 
bers and x? € [x?, x], and compute T and F2 such that Z and T? are 


machine representable numbers and &* € [z7, 7]. 


2. RETURN (0, max eed 
END IF 


END ALGORITHM 1.1. 


With Algorithm 1.1 and rewriting f(x) from (1.8) as in (1.10), we obtain 


which, in this case, is equal to the exact range of f over [—2, 2]. 

In fact, this illustrates a general principle: If each variable in the expression 
occurs only once, then interval arithmetic gives the exact range, to within 
roundout error. We state this formally as 


THEOREM 1.9 

(Fundamental theorem of interval arithmetic.) Suppose f(a1,@2,..-,@n) 1s 
an algebraic expression in the variables x through x, (or a computer program 
with inputs x1 through x,), and suppose that this expression is evaluated with 
interval arithmetic. The algebraic expression or computer program can con- 
tain the four elementary operations and operations such as x”, sin(x), exp(2), 
and log(x), etc., as long as the interval values of these functions contain their 
range over the input intervals. Then 


1. The interval value f(a1,...,%n) contains the range of f over the inter- 
val vector (or box) (a@1,...,@n). 


Mathematical Review and Computer Arithmetic 27 


2. If the single functions (the elementary operations and functions x”, etc.) 
have interval values that represent their exact ranges, and if each vari- 
able x;, 1 <4 <n occurs only once in the expression for f, then the 
values of f obtained by interval arithmetic represent the exact ranges of 
f over the input intervals. 


If the expression for f contains one or more variables more than once, 
then overestimation of the range can occur due to interval dependency. For 
example, when we evaluate our example function f({—2, 2]) according to (1.8), 
the first factor, [—1,3] is the exact range of x +1 for x € [—2,2], while the 
second factor, [—3, 1] is the exact range of x —1 for x € [—2, 2]. Thus, [—9, 3] 
is the exact range of f(%1, 22) = (a1 + 1)(x2 — 1) for 2; and x2 independent, 
T1E [—2, 2], t2€ [—2, 2]. 

We now present some definitions and theorems to clarify the practical con- 
sequences of interval dependency. 


DEFINITION 1.6 An expression for f(11,...,Un) which is written so 
that each variable occurs only once is called a single use expression, or SUE. 


Fortunately, we do not need to transform every expression into a single use 
expression for interval computations to be of value. In particular, the interval 
dependency becomes less as the widths of the input intervals becomes smaller. 
The following formal definition will help us to describe this precisely. 


DEFINITION 1.7 Suppose an interval evaluation f(x1,...,%n) gives 
[a, b] as a result interval, but the exact range {f(x1,..-,%n), ti € Bi,1<i<n} 
is [c,d] C [a,b]. We define the excess width E(f;21,...,%n) in the interval 
evaluation f(a1,...,@n) by E(f;a1,...,an) = (c—a)+ (b-d). 


For example, the excess width in evaluating f(x) represented as (x+1)(a#—1) 
over x = [—2,2] is (—1 — (—9)) + (8 — 3) =8. In general, we have 


THEOREM 1.10 

Suppose f (#1, @2,..-,%n) ts an algebraic expression in the variables x1 through 
Ln (or a computer program with inputs x, through x,), and suppose that this 
expression is evaluated with interval arithmetic, as in Theorem 1.9, to obtain 
an interval enclosure f(a#1,...,%n) to the range of f for x; € xj, 1<i<n. 
Then, if E(f;x1,...,%n) 1s as in Definition 1.7, we have 


E(f;@1,...,@n) = O( max w(a;)), 


l<i<n 


where w(x) denotes the width of the interval x. 
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That is, the overestimation becomes less as the uncertainty in the arguments 
to the function becomes smaller. 

Interval evaluations as in Theorem 1.10 are termed first-order interval ex- 
tensions. It is not difficult to obtain second-order extensions, where required. 
(See Exercise 26 below.) 


1.3.2 Application of Interval Arithmetic: Examples 


We give one such example here. 


Example 1.17 
Using 4-digit decimal floating point arithmetic, compute an interval enclosure 
for the first two digits of e, and prove that these two digits are correct. 


Solution: The fifth degree Taylor polynomial representation for e is 


ee eg ale les 
ej 1+ tat3atat at 6° ‘ 

for some € € [0,1]. If we assume we know e < 3 and we assume we know e* 

is an increasing function of x, then the error term is bounded by 


so this fifth-degree polynomial representation should be adequate. We will 
evaluate each term with interval arithmetic, and we will replace e§ with [1, 3]. 
We obtain the following computation: 


[1.000, 1.000] + [1.000, 1.000] — [2.000, 2.000 
1.000, 1.000]/[2.000, 2.000] — [0.5000, 0.5000] 
(2.000, 2.000] + [0.5000, 0.5000] — [2.500, 2.500 
1.000, 1.000]/[6.000, 6.000] — [0.1666, 0.1667] 
(2.500, 2.500] + [0.1666, 0.1667] — [2.666, 2.667 
1.000, 1.000]/[24.00, 24.00] — [0.04166, 0.04167] 
[2.666, 2.667] + [0.04166, 0.04167] — [2.707, 2.709 
1.000, 1.000]/[120.0, 120.0] — (0.008333, 0.008334] 
[2.707, 2.709] + [0.008333, 0.008334] — [2.715, 2.718 
1.000, 1.000]/[720.0, 720.0] — [0.001388, 0.001389] 
[.001388, .001389] x [1,3] — [0.001388, 0.004167] 
(2.715, 2.718] + [0.001388, 0.004167] — [2.716, 2.723] 


Since we used outward rounding in these computations, this constitutes a 
mathematical proof that e € [2.716, 2.723]. 
Note: 
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1. These computations can be done automatically on a computer, as simply 
as evaluating the function in floating point arithmetic. We will explain 
some programming techniques for this in Chapter 6, Section 6.2. 


2. The solution is illustrative. More sophisticated methods, such as argu- 
ment reduction, would be used in practice to bound values of e” more 
accurately and with less operations. 


Proofs of the theorems, as well as greater detail, appear in various texts 
on interval arithmetic. A good book on interval arithmetic is R. E. Moore’s 
classic text [58] although numerous more recent monographs and reviews are 
available. A World Wide Web search on the term “interval computations” 
will lead to some of these. 

A general introduction to interval computations is [57]. That work gives 
not only a complete introduction, with numerous examples and explanation 
of pitfalls, but also provides examples with INTLAB, a free MATLAB toolbox for 
interval computations, and reference material for INTLAB. If you have MAT- 
LAB available, we recommend INTLAB for the exercises in this book involving 
interval computations. 


1.4 Exercises 
1. Answer the following: 


a) Prove that je” — e¥| < ja —y|, Va,y <0. 
b) Show that 


py? "(a —y) <a? —y? < pa? "(x —y), 


forO<y<a, and p> 1. 


c) Assume f(a) is continuous for a < x < 6 and let 
S=))w;f (23); 
j=l 


with 2; € [a,b], w; > 0 for j =1,..., n, and )°_, w; = 1. Show 
that S = f(€) for some € € |a, dB]. 


d) Let f’ € Cla, 6] and f’(x) 4 0 for all x in (a,b). Determine at how 
many points the function f(x) can possibly vanish in [a, 6]. Explain 
your answer. 
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. Write down a polynomial p(x) such that |sinc(x) — p(x)| < 107'° for 


—0.2 <a < 0.2, where 


a ea 
sinc(#) = x 
1 ifx=0 


is the “sinc” function (well-known in signal processing, etc.). Prove 
that your polynomial p satisfies the condition |sinc(x) — p(x)| < 1071° 
for x € [—0.2, 0.2]. 

Hint: You can obtain polynomial approximations with error terms for 
sinc(z) by writing down Taylor polynomials and corresponding error 
terms for sin(x), then dividing these by x. This can be easier than trying 
to differentiate sinc(a). For the proof part, you can use, for example, 
the Taylor polynomial remainder formula or the alternating series test 
and you can use interval arithmetic to obtain bounds. 


. Suppose f has a continuous third derivative. Show that 


a 
2h 


- Fo) = o(h). 


. Suppose f has a continuous fourth derivative. Show that 


x th) ~ 2f(x) + f(a —h) 


i - £"(@)| = 00?) 


. Let tp = e!/” —1/n—1 and a, = 1/n. Show that r_, = O(a2). 


n+1 
2 


. Let 2, = ——— and a, = 1/n. Show that x, = o(an). 


n? Inn 


. Let a = 0.326, b = 0.000135, and c = 0.000431. Assuming 3-digit 


decimal computer arithmetic with rounding to nearest, does a+ (b+c) = 
(a + b) +c when using this arithmetic? 


. Let a = 0.41, 6b = 0.36, and c = 0.7. Assuming a 2-digit decimal com- 


a—b 


b 
puter arithmetic with rounding, show that F o ? when using 
c 6¢ 


this arithmetic . 


. Let x¥ for i = 1, 2, 3, 4 be positive numbers on a computer. With a 


unit round-off error 6, x* = x; (1+ ;) with |e;| < 6, where x; for 1 = 1, 
2, 3, 4 are the exact numbers. Consider the scalar product S2 = a™b 
where a = [21, x2]? and b = [x3, xa]?. Let $3 be the floating point 
approximation of $2. Prove that $3/S2 < e*°. 


Suppose IEEE single precision with rounding to nearest is being used. 
What is the maximum relative error |(a — fi(x))/a], for 
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(a) x € [2-°, 2-27}? 
(b) x € [10000000, 10100000}? 
(c) x € [2®, 282]? 


11. Repeat Exercise 10, but assume that IEEE double precision rather than 
IEEE single precision is used. 


12. Write down a formula relating the unit roundoff 6 of Remark 1.5 (page 12) 
and the machine epsilon ¢,,, defined on page 20. 


13. Suppose, for illustration, we have a system with base @ = 10, t = 3 
decimal digits in the mantissa, and L = —9, U = 9 for the exponent. 
For example, 0.123 x 104, that is, 1230 is a machine number in this 
system. Suppose also that “round to nearest” is used in this system. 


(a 
(b 


(c 
(d 


What is HUGE for this system? 
What is TINY for this system? 
What is the machine epsilon e€,, for this system? 
Let f(a) = sin(#) +1. 
i. Write down fl(f(0)) and fi(f(0.0008)) in normalized format 


for this toy system. 


ii. Compute fi(fi( f(0.0008)) — fi(f(0))) On the other hand, what 
is the nearest machine number to the exact value of f (0.0008) — 
f(0)? 

iii. Compute fi(fi( f(0.0008)) — fl( f(0)))/fl(0.0008). Compare this 
to the nearest machine number to the exact value of (f (0.0008) — 
f(0))/0.0008 and to f’(0). 


WN we a os S 


14. Consider evaluation of f(x) = log(a + 1) — log(x), for x large. 


(a) Discuss the effects of roundoff error on the absolute error and rel- 
ative error in the approximation of f(z). 


(b) Propose an alternate expression for evaluating f(a) for large x 
that is less vulnerable to roundoff error. (Hint: The truncation 
error (“method error”) may be nonzero for such an expression, but 
could be negligible in relation to the unit roundoff error.) 


(c) Test your predictions from part 14a and your expression from 
part 14b by making some computations for large x, say, using MAT- 
LAB (which employs double precision IEEE arithmetic), or, say, by 
writing a short Fortran, C, or C++ program. 


In(w + 1) — In(az) 
5 ‘ 
(a) Use four-digit decimal arithmetic with rounding to evaluate f(1000). 


15. Let f(x) = 


32 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 
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(b) Rewrite f(z) in a form that avoids the loss of significant digits and 
evaluate f(x) for « = 1000 once again. 


(c) Compare the relative errors for the answers obtained in (a) and 
(b). 


Assume that x* and y* are approximations to x and y with relative 
errors r; and r,, respectively, and that |rz|, |ry| < R. Assume further 
that « 4 y. How small must RF be in order to ensure that x* 4 y*? 


Let x* and y* be the floating point representations of x and y, respec- 
tively. Let f(xz*,y*) be the approximate value of f(x,y). Derive the 
relation between the relative error in evaluating the function f(z, y) 
in terms of the relative error in evaluating x and the relative error in 
evaluating y. 


The formula for the net capacitance when two capacitors of values x 
and y are connected in series is 


ty 


a 3 
cr+y 


Suppose the measured values of x and y are x* = 1 and y* = 1, respec- 
tively. Estimate the relative error in the function evaluation of 


xvry* 


2S 


given |a — 2*| = 1 and |y— y*| = 1. 


Show that the condition number of the product f(x) - g(x) of functions 
satisfies K g(a) < K p(x) + Kg (a). 


Compute the condition number of f(a) = ev® ~', x > 1 and discuss 
any possible ill-conditioning. 


Let f(x) = x10 and let x* approximate x correctly to k significant 
decimal digits (with rounding). Prove that f(x*) approximates f(x) 
correctly to (k + 1) significant decimal digits. 


The function f(x) = e*/? is to be evaluated for any x,(0 < a < 25) 
correct to 5 significant decimal digits. What digit decimal rounding 
arithmetic should be used (i.e., in 2) to get the required accuracy in 
evaluating the function f(x)? 


1 

ike 1 

Define I, = if dt. Show that Ip = In(2) and that I, = — — In-1 
t+1 n 


0 
for n > 1. Describe the efficiency of computing J; for 7 = 1,..., 10 using 
the recurrence formula obtained in part (a) and estimate the accuracy 
of Ij9. Explain your observation. 


24. 


25. 


26. 
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Use the Taylor polynomial for arctan(x) centered at 7%» = 0 and interval 
arithmetic to compute mathematically rigorous bounds a and 7 on 7, 
such that 7-2 < 107°. 

Hint: You may use the interval arithmetic in the computer algebra sys- 
tems Mathematica or Maple, or you may use the interval arithmetic in 
the INTLAB toolbox for MATLAB. An introduction to INTLAB, along with 
reference material for it, can be found in [57]. 


Let f(x) = (sin(z))? + 2/2. Use interval arithmetic to prove that there 
are no solutions to f(x) =0 for x € [—1, —0.8]. 


Let f(x) = 2? — x. One way of obtaining a bound on the range of f is 
by evaluating f directly, using interval arithmetic, while another way is 
by using the mean value theorem to obtain 


Ff my (&) = f(z) a f(x) (a = z), 


where & € x is an approximation to the midpoint of x and where f’ (2) 
is an interval evaluation of the derivative of f over x. (This second 
bound on the range is usually called the mean value extension of f.) 


(a) For a; = [1—6,1+ e], ¢ = 1/4", i = 1, 2, ..., 10, form a table 
whose columns are as follows: 


BFsa) wed PER 2) wlan) 


where E(f; x) = w(f(a;)) —w(f“(ax;)) denotes the “excess width” 
and where f“(a) denotes the exact range of f over x. 


(b) Do the same as in part 26a, except use f,,,,(a) in place of f(a). 


(c) Based on your results in parts 26a and 26b, give values for a, and 
a2 such that E(f;a) = O(w(x)%') and Eny(f;x) = O(w(ax)°?). 
(In fact, your conclusions hold in general.) 


27. Repeat Problem 26, but with intervals a; = [2 — e;,2 + e;]. 


Chapter 2 


Numerical Solution of Nonlinear 
Equations of One Variable 


2.1 Introduction 


In this chapter, we study methods for finding approximate solutions to 
the scalar equation f(z) = 0. Some classical examples include the equation 
x —tanz = 0 that occurs in the diffraction of light, or Kepler’s equation 
x — bsinz = 0 used for calculating planetary orbits. Other examples include 
transcendental equations such as f(a) = e* + « = 0 and algebraic equations 
such as x’ + 42° — 7a? +62 +3 = 0. There are several reasons for beginning 
the study of numerical analysis with this problem: 


1. The problem occurs frequently, i.e., the methods are useful. 


2. The problem illustrates the iterative method of solution, which is a 
common numerical technique. 


3. Convergence results are easy to derive. 


4. The problem illustrates that, when many methods are available, selec- 
tion of the most suitable method depends on the particular problem. 


2.2 Bisection Method 


The bisection method is simple, reliable, and almost always can be applied, 
but is generally not as fast as other methods. Note that, if y = f(a), then 
f(x) = 0 corresponds to the point where the curve y = f(a) crosses the x-axis. 
The bisection method is based on the following result. 


THEOREM 2.1 
Suppose that f € Cla,b] and f(a)f(b) < 0. Then there is a z € [a,b] such 
that f(z) =0. 
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PROOF = The proof follows directly from the Intermediate Value Theorem. 
That is, if f € Cla, b] and & is any number between f(a) and f(b), then there 
is a z € [a,b] such that f(z) =k. Let k =0. (See Figure 2.1.) 


a Z b 


FIGURE 2.1: Example for a special case of the Intermediate Value Theo- 
rem (Theorem 2.1). 


The method of bisection is simple to implement as illustrated in the following 
algorithm: 


ALGORITHM 2.1 

(The bisection algorithm) 
INPUT: An error tolerance ¢€ 
OUTPUT: Either a point x that is within ¢€ of a solution z or “failure to find 
a sign change” 


1. Find a and b such that f(a) f(b) < 0. (By Theorem 2.1, there is a 
z € [a,b] such that f(z) = 0.) (Return with “failure to find a sign 
change” if such an interval cannot be found.) 


2. Letag=a, bb = b,k=0. 
3. Let x, = (ax + bg)/2. 
4. IF f(a) f(ax) > 0 THEN 


(a) Qx41 — Zk, 
(b) be4i <— Op. 


ELSE 


(a) D441 — Xk; 
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(b) Gp41 — Ag. 
END IF 


5. IF (by — ag)/2 < 
THEN 


Stop, since x, is within € of z. (See the explanation below.) 
ELSE 
(a) k—k-+1. 
(b) Return to step 3. 
END IF 
END ALGORITHM 2.1. 


Basically, in the method of bisection, the interval [a,,b,] contains z and 
be — Gp = (be-1 — Gp-1)/2. The interval containing z is reduced by a factor 
of 2 at each iteration. 


Note: In practice, when programming bisection, we usually do not store the 
numbers a, and by for all k as the iteration progresses. Instead, we usually 
store just two numbers a and b, replacing these by new values, as indicated 
in Step 4 of our bisection algorithm (Algorithm 2.1). 


" f(z) =e" +2 


FIGURE 2.2: Graph of e” + x for Example 2.1. 


Example 2.1 
f(x) =e? +2, f(0) = 1, f(—1) = —0.632. Thus, -1 < z < 0. (There is 
a unique zero, because f’(#) = e? +1 > 0 for all x.) Setting ag = —1 and 


bo = 0, we obtain the following table of values. 
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k 
0 
1 
2 
3 | —0.625 | —0.500 | —0.5625 
4 | —0.625 | —0.5625 | —0.59375 
Thus z € (—0.625, —0.5625); see Figure 2.2. 0 


REMARK 2.1 The method always works for f continuous, as long as a 
and 6 can be found such that f(a) f(b) < 0 (and as long as we assume roundoff 
error does not cause us to incorrectly evaluate the sign of f(«)). However, 
consider y = f(x) with f(x) > 0 for every x, but f(z) = 0. There are no 
a and 6 such that f(a)f(b) < 0. Thus, the method is not applicable to all 
problems in its present form. (See Figure 2.3 for an example of a root that 
cannot be found by bisection.) 


FIGURE 2.3: Example for Remark 2.1 (when f does not change sign in 
[a, b]). 


We now have the question: How can we estimate the error |x; — z| in the 
k-th iteration? 


THEOREM 2.2 
Suppose that f € Cla, b] and f(a) f(b) <0, then 


b—a 
PROOF Combining for k > 1, 
by — ay = (b — a) /2*, 
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z € (ax, bx), and x, = (ax + by) /2 gives 


|z—ap| < (bx — az) = (b— a) /2**". 


REMARK 2.2 Thus, in the algorithm, if $(b, — ax) = (b—a)/2**1 <e, 
then |z — x] <e. 


Example 2.2 

How many iterations are required to reduce the error to less than 10~® if 
a=Oandb=1? 

Solution: We need s47(1 — 0) < 107°. Thus, 2°! > 10°, or k = 19. 


2.3. The Fixed Point Method 


The method can be stated in terms of simple principles here. However, these 
principles are important both in the multidimensional analogue we describe 
in Section 8.2 (on page 442) and in infinite-dimensional analogues used in the 
mathematical analysis of differential equations. Let G be a closed subset of 
R or C. 


DEFINITION 2.1 z€G is a fixed point of g if g(z) = z. 


REMARK 2.3 _ If f(x) = g(x) — x, then a fixed point of g is a zero of f. 


The fixed-point iteration method is defined by the following: For xo € G, 


Ce+1 =Gg(a~) for k=0,1,2,.... 


Example 2.3 
g(x) = $(@ +1), xo =0, ee41 = $(ae +1). Then xp = 0, 21 = 1/2, xp = 3/4, 
x3 = 7/8, 4 = 15/16,... 


An important question is: when does {x;}?2 converge to z, a fixed point 
of g? Fixed-point iteration does not always converge. Consider g(x) = 2”, 
whose fixed points are x = 0 and x = 1. If xo = 2, then ry41 = 77, 80% =4, 


x = 16, x3 = 256,... 
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The following definition is useful when discussing convergence of fixed point 
iteration. 


DEFINITION 2.2 g satisfies a Lipschitz condition on G if there is a 
Lipschitz constant L > 0 such that 


lg(z) — g(y)| < Lx — y| for all x,y € G. (2.1) 


If g satisfies (2.1) with 0 < L <1, g is said to be a contraction on the set G. 
We now have the following well-known result: 


THEOREM 2.3 

(Contraction Mapping Theorem in one variable) Suppose that g maps G into 
itself (t.e., if x © G then g(x) € G) and g satisfies a Lipschitz condition with 
0<L<1 (ie., g is a contraction on G). Then, there is a unique z € G such 
that z = g(z), and the sequence determined by xo € G, te41 = g(xz), k = 0, 


1, 2, --- converges to z, with error estimates 
k 
tz — 2] S$ ~—Fle1 — Bol, k=1,2,--- (2.2) 
L 
|v, — 2| < Toler 7 2e-1) k=1,2,--- (2.3) 


Before we prove the Contraction Mapping Theorem, we review the following 
concept. 


DEFINITION 2.3 A sequence {xz }?2, is called a Cauchy sequence if, 

given any € > 0, there is an N dependent on € such that |x, — x| < € for 
every k and & greater than N. Cauchy sequences in R and C must converge to 
a point in R or C, respectively. Number systems in which Cauchy sequences 
converge are called complete spaces. 


PROOF (of Theorem 2.3) Note that a, € G, k = 0, 1, 2, ..., since 
g(x) € Gif x € G. Consider 


lte+1 — Fe] = |g(@k) — 9(TK-1)| 
< Lay — tp—1| < L?|ag_1 — te-2| < +++ < L* laa — zl. 


Also, 


|t~ — 2e4g| S lee — Cepi| + [ees — epal +--+ + |eegy—1 — ety! 
<EL+ Eh +-+-+ 27-1) \ay — 2p-1| 


L 
< 
~1-L 


|v — Uk-1 |. 
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Hence, 
L i 
tk — te+j| S Pz lte — te-1| S [zl — 0. (2.4) 
Now, let m=k andn=k+ . Then, 
‘Lae 
lfm — In| < |v1 — Lol. (2.5) 


1-L 


Therefore, given € > 0, there is an N sufficiently large such that if m,n > N, 
|a2n —Lm| <e. (Recall that 0 < L < 1.) Thus, {x;}?2o is a Cauchy sequence, 
and converges, since R (or C) is complete. However, since x41 = g(x) and 
g is continuous, 


z= lim &gy1 = lim g(axg) = g(z). 
k—o0o k—oo 
Thus, {x} converges to a fixed point of z. 
Now consider uniqueness of z. Suppose that z; and z are both fixed points, 
Le., 21 = g(21) and 22 = g(z2), and z1 # zg. Then 
|z2 — z1| = |g(z2) — g(z1)| < Llz2 — z1| < [22 — a]. 


This contradiction shows that the fixed point is unique. Note that to obtain 
(2.2) and (2.3), just let 7 — oo in (2.4). 


REMARK 2.4 Observe that |x, — z| = |g(ve—1) — 9(z)| < Llap—i — |. 
This inequality indicates that fixed-point iteration has at least a linear rate 
of convergence. 


REMARK 2.5 There are two technical difficulties in applying the above 
theorem. 


1. It may be difficult to find G such that g maps G into itself. 


2. It may be difficult to show that g satisfies a Lipschitz condition on G. 


0 


Throughout this section, we will show how to overcome these technical 
difficulties. 


PROPOSITION 2.1 
Suppose that g is continuously differentiable and that |g'(a)| < L forx eG. 
Then g satisfies a Lipschitz condition with Lipschitz constant L for x € G. 


PROOF By the Mean Value Theorem (Theorem 1.4 on page 3) followed 
by application of the assumption on |g’(x)|, we have 


lay) — 9(@)| = I9"()lly — 2] < Ely — 2| 
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forxe€ Gand ye€G. 
Example 2.4 
Suppose 
an 
g(x) = ae a5 720’ 
and suppose we wish to find a Lipschitz constant for g over the interval 
[—1/2,1/2]. 


We will proceed by an interval evaluation of g’ over [—1/2,1/2]. Since 
g(x) = —x?/2+2*/24, we have 


gf ((-1/2,1/2)) € -5[-1/2,1/22 + s1-1/2,1/2F 


1 1 
C [-0.125, 0] + [0, 0.002605] € [—0.125, 0.00261]. 


Thus, since |g’(x)| < maxye(_0.125,0.00261 |y| = 0.125, g satisfies a Lipschitz 
condition with Lipschitz constant 0.125. 


Assume in the following that G is a closed subset of R. Two useful results 


for showing that g satisfies the two conditions of the Contraction Mapping 
Theorem (Theorem 2.3 on page 40) are the following. 


PROPOSITION 2.2 
Suppose that g is continuously differentiable and |g'(x)| < L <1 forx eG, 
then g is a contraction on G. 


PROOF Let xand% € G. (Without loss of generality assume that ¢ > wx.) 


Then 
[ sous 


Thus, g is a contraction on G. 


b@—4@)(= < [ |a'(s)lds < Ble — al, 


PROPOSITION 2.3 
Let p >0 and G = [c— p,c+ p]. Suppose that g is a contraction on G with 
Lipschitz constant L,0< LD <1, and 


Igo) —e SA L)p. 


Then g maps G into itself. 
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PROOF Suppose that « € G. (We need to show that g(x) € G.) 


Then |g(x) — ¢| 


< |9(x) — g(e) + ge) — el < |9(2) — g()| + Iolo) - el 
< Liz—el + (1—-L)p < Lp + (1-L)p=p. 


Thus, g(x) € G. 


The following result is also useful. 


PROPOSITION 2.4 

Assume that z is a solution of « = g(x), g'(x) is continuous in an interval 
about z, and |g'(z)| < 1. Then g is a contraction in a sufficiently small 
interval about z, and g maps this interval into itself. Thus, provided xo is 
picked sufficiently close to z, the iterates will converge. 


PROOF _ Select ZL such that |g’(z)| < L <1. Then select an interval 
I =([z-—e,z2+€] with maxzer |g'(x)| < L <1. We have that g: I — I since 
ifael, 


lz — 9(2)| = |9(2) — g(@)| = |g’ (Q)llz — 2] < Llz- a] <e. 


The contraction mapping theorem can then be used with G = I. ] 
Example 2.5 
Let ‘ 
x 
g(x) = a 


Can we show that the fixed point iteration 2,41 = g(a,) converges for any 
starting point xo € [1,2]? We will use Proposition 2.2, Proposition 2.3, and 
Theorem 2.3 to show convergence. In particular, g'(x) = 1/2 — 1/a?. Evalu- 
ating g(a) over [1,2] with interval arithmetic, we obtain 


Thus, since g/(x) € g’((1,2]) € [-$, 4] for every x € [1,2], 


1 
"(x)| < max |a| == 
la’ ( |S max, la 5) 
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for every x € [1,2]. Thus, from Proposition 2.2, g is a contraction on [1, 2]. 
Furthermore, letting p = 1/2 and c = 3/2, |g(3/2) —3/2| = 1/12 < 1/4. Thus, 
by Proposition 2.3, g maps [1,2] into [1,2]. Therefore, we can conclude from 
Theorem 2.3 that the fixed point iteration converges for any starting point 
xo € [1,2] to the unique fixed point z = g(z). 


Example 2.6 
Consider the fixed point equation 


g(x) = b+ ©, with a,b>0. 
x 


Can we find z such that z = g(z) and the fixed-point iterates converge to z? 
The fixed-point iteration 


a 
Lk 


for this example gives 


x, =b+a/xo 
a 
vq = b+a/x; = b+ —— 
2 pe b+ 2 
a a 
=b+—=6+ 
os v2 Oo rears 


Hence, we have the question whether this continued fraction expansion is 
converging to z, where z = g(z). (Note that if g(z) = z, then b+ $ = z, 


bz +a= 27, 80 
_ b+ Vb? +4a 
a ee ao 
is the fixed point.) To apply the Contraction Mapping Theorem (Theorem 2.3 
on page 40), we need to 
(a) determine a set G on which g is a contraction, and 


(b) make sure that G is such that g: G— G. 


For part (a), suppose that G = [c — p,c+ p]. We have g/(x) = —a/x?, so 
\g’(x)| <1 for x > /a. Assume that c,p > 0. If c—p > Va, then |g’(x)| <1. 
Let 
— b+ Vb74+4 
p= is va and c= — 


Then c— p> Va, so g is a contraction on G. 
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For part (b), we need to show that g maps [c — p,c+ p] into itself. By 
Proposition 2.4, if |g(c) — c| < (1 — L)p then g : G > G. However, since 
g(c) = c, we need 0 < (1—L)p. But, p = aye >O0and0< L <1, so 
<= Dip: 

For example, let a = b= 1. Then, 


and we may take 
G = [e— p,c+ p] = [1.31, 1.93]. 


Let. 
1 1 
t=1, 4 =14+-=2, so=1+ 7 =15EG,--: 
1 1+¢ 
1 5 
Thus, ©3, %4, 5, --- will be in G and will converge to oe That is, 
1 1l+v5 
1+ ——_,—— = v5. 
eo wr ae 2 


Note that G 4 R for this example. Consider a = b = 1, so @e4, = 1+ eat 


and consider xo = —t, x1 =—1. Then x3 is undefined. ] 


Example 2.7 
Let g(x) =4+ 4 sin 2x and rp41 = 4+ 4sin2x,. Observing that 


2 
<a 
3 


2 
lg’ (x)| = F cose 


for all x shows that g is a contraction on all of R, so we can take G = R. Then 
g:G—G and g isa contraction on R. Thus, for any xp € R, the iterations 
Le+1 = g(x) will converge to z, where z = 4+ 3 sin 2z. For x = 4, the 
following values are obtained. 


46 Classical and Modern Numerical Analysis 


REMARK 2.6 Suppose that g satisfies the conditions of Theorem 2.3 (the 
Contraction Mapping Theorem) for G = [a,b]. Suppose also that g(x) < g(y) 
fora <a < y < b, that is, g is a monotonically increasing function. (For 
example, g is monotonically increasing if 0 < g(a) < 1 for x € [a,b].) Let 
Lo = a, $0 x1 = g(xo) > xo. (We have x; > a = 2 because 21 € [a, b].) In 
fact, 2 = g(a1) > g(ao) = x1 since x1 > ap. Thus, a= 2% < a1 <a <--: 
and x, — z monotonically as k — oo. Figure 2.4 illustrates this geometrically. 
A similar result holds if g is monotonically decreasing on [a, }}. 


FIGURE 2.4: Example of Remark 2.6 (monotonic convergence of fixed 
point iteration). 


We now consider an interesting result on the order of convergence for fixed- 
point iteration. Recall if iim LE = z and |a@p41—2| < clap —2|*, we say {xp} 
—> CO 


converges to z with rate of convergence a. (We specify that c < 1 for a = 1.) 


THEOREM 2.4 

Assume that the iterations «p41 = g(x) converge to a fixed point z. Fur- 
thermore, assume that q is the first positive integer for which g®(z) #0 and 
if ¢q = 1 then |g'(z)| < 1. Then the sequence {x,} converges to z with order 
q. (It is assumed that g € C1(G), where G contains z.) 


REMARK 2.7 Generally gq = 1, so for many fixed-point iterations the 
order of convergence is linear. ] 
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PROOF We have 


|te+1 — 2| = |g(@x) — 2| 


(re — 2)4 


= |9(z) +. 9'(2) (ae -2) +--+ 9 (6%) es 
(where E, is between x, and z) 
(q) 
= Eee ay — 2)" 
q! 
Thus, 
lg (Ex)| rece 4 
Ceti —2| < rs rr a — 2|7 =claz, — 2|%, 
where es 
q 
aman OL 
EcGq! 
Example 2.8 
Let i 
a +6 
g(z) = —> 


and G = [1,2.3]. Notice that g: G— G and 


20 
— 1 
5 << 


g(a) = 


for « € G. Then by Theorem 2.3 there is a unique fixed point z € G. It is 
easy to see that the fixed point is z = 2. In addition, since g’(z) = 2 # 0, 
there is a linear rate of convergence. Inspecting the values in the following 
table, notice that the convergence is not fast. 


Example 2.9 
Let 
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be as in Example 2.5. It can be shown that if 0 < a < 2, then x, > 2. 
Also, t% > ®p41 > 2 when x, > 2. Thus, {z,} is a monotonically decreasing 
sequence bounded by 2 and hence is convergent. Thus, for any xo € (0,00), 
the sequence 2441 = g(xp) converges to z = 2. 
Now consider the convergence rate. We have that 
1 2 

/ 

so g'(2) = 0, and 
4 
" 7 
g (x) — Poe 


so g’(2) £0. By Theorem 2.4, the convergence is quadratic, and as indicated 
in the following table, the convergence is rapid. 


k 

0 

1 

2 | 2.00002 

3 | 2.00000000 
Example 2.10 
Let j 

g(x) = =a* — 4. 


There is a unique fixed point z = 2. However, g/(x) = 32°, so g’(2) = 12, 
and we cannot conclude linear convergence. Indeed, the fixed point iterations 
converge only if x9 = 2. If xo > 2, then 71 > rq > 2, 2 > 21 > XO > 2,---. 
Similarly, if rq < 2, it can be verified that, for some k, x, < 0, after which 
Lp+1 > 2, and we are in the same situation as if x9 > 2. That is, fixed point 


iterations diverge unless xp = 2. 


Example 2.11 
Consider the method given by x41 = g(xe) with 


sea Fo es FT 


It is straightforward to show that if f(z) = 0, then g’(z) = g”(z) = 0 but 
g'"(z) 0 (generally). Thus, when the method is convergent, the method has 
a cubic convergence rate. 


f(z) — £'(@) 
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2.4 Newton’s Method (Newton-Raphson Method) 


We now return to the problem: given f(x), find z such that f(z) = 0. 
Newton’s iteration for finding approximate solutions to this problem has the 
form 


Ue+1 = LE — ae for k = 0,1,2,---. (2.6) 


REMARK 2.8 Newton’s method is a special fixed-point method with 
g(a) = «— f(x)/f' (2). 


Figure 2.5 illustrates the geometric interpretation of Newton’s method. To 
find t,41, the tangent line to the curve at point (xx, f(xx)) is followed to 
the z-axis. The tangent line is y — f(x.) = f’(ax)(a — vy). Thus, at y = 0, 
r= xp — f(re)/f' (te) = Te41- 


(te, f(rK)) 


(re+1, f(te+41)) 


LE CkAILk+2 


FIGURE 2.5: Illustration of two iterations of Newton’s method. 


REMARK 2.9 We will see that Newton’s method is quadratically con- 
vergent, and is therefore fast when compared to a typical linearly convergent 
fixed point method. However, Newton’s method may diverge if xo is not 
sufficiently close to a root z at which f(z) = 0. To see this, study Figure 2.6. 

Another conceptually useful way of deriving Newton’s method is using Tay- 
lor’s formula. We have 
(z— 2x)? 


0 = f(2) = Flan) + f'(en)(= — 2) + S™* f"(G), 
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y y 


x 
Zz k Tk+1 Lk+2 k Tk 


FIGURE 2.6: Example for Remark 2.9 (divergence of Newton’s method). 
On the left, the sequence diverges; on the right, the sequence oscillates. 


where &; is between z and xz. Thus, assuming that (z — x)? is small, 

f(x) 

f'(xx) 

Hence, when xp41 = @p — f(a~)/f' (ap), we would expect x41 to be closer to 
z than xp. 


ZYOLE 


We now study convergence of Newton’s method. We make use of the fol- 
lowing Lemma. 


LEMMA 2.1 

Let G be a subset of R. Assume that f € C?(G). Then for x, y € G, 
f(y) = f(x) + f'(@)(y — 2) + RYy, 2), 

where 


Riy,2) = (y—2)? [ (1-8) "(e+ s(y—2))ds. 


PROOF By Taylor’s formula, 
y 


fo) =f@) +f @w-2) + : (y —t)f" (Bat. 


x 


Let s = (t—2)/(y—2), t= s(y—2) +2, dt = ds(y— 2). Then 


fy) = f(a) + F@y—2) 4 i (y—2)?(1 — 8) f"(a + s(y — 2))ds. 


THEOREM 2.5 
(Convergence of Newton’s method; existence and uniqueness of a root.) Let 


GCR. Assume that f € C?(G) and f satisfies 
|f'(x)| 2m, |f"(«)| <M 
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forx © G, wherem, M > 0. Then for each zero z € G, there is a neighborhood 
K,(z) = [z-p,z+p] CG such that z is the only zero of f in K,(z), and for 
each x9 € K,(z), the approximations x1, %2,--- remain in K,(z) and converge 


to z with error estimates: 


2 M 

(a) |an—2|< a", where g= a0 ~ al, 
1 M 

(b) |z, —2| < lf (ex) | < amt — tp-1|’, 


M 
(c) |tea1 — 2] < a(t —2z|? (quadratic convergence) 
m 


for k = 1,2,---. Also, p can be selected to be any number less than 2m/M 
provided that K,(z) CG. 


Theorem 2.5 is a one-dimensional analog of the Newton—Kantorovich theo- 
rem, which we will see in §8.3.5 on page 454. 


PROOF = By the Mean Value Theorem, 


OE = ire m 
for some € € G, whenever x € K,(z) and ¢ € K,(z) C G. Thus, 
mic —&| <|f(«)-— f(@)| for 2,@€ K,(z) CG. (2.7) 
By Lemma 2.1, 


f(x)(@—y) = f(x) - fy) + - of f (1—s)f"(x + s(y — 2))ds. 
Thus, 
|f'(x)||z — yl < |flx) — F@)|+ (y—a)?M5 (2.8) 


Now choose p < 2m/M. Then there is a unique zero in K,(z). To see this, 
let z; # z be another zero. Then by (2.8), 


1 
If (alle — 21S 5M — 2). 


Thus, |z1 —z| > 2m/M > p, which is not possible since K,(z) = (z¢—p, z+ ). 
Now consider g(x) = a — f(x)/f'(x) for « € K,(z). Then 


1 ! __! 
g(x) —z=— [f(x) + (2-2) f'(x)| = Fi) 


R(z,2) 
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by Lemma 2.1. Thus, 


|R(z,x)| ~ 1M 2 2M 
— z| < ——_ < -—|z- < p°— <p. 2.9 
We al< ey Sam TSP one = C8) 
Hence, g(x) € [z — p,z + p] = K,(z) for x € K,(z). Thus, g maps K,(z) into 
K,(z). Let = xp, Ce41 = g(te), k = 1, 2,---. Thus, the approximations 


remain in K,(z). Also, by (2.9) (letting x = z,), 
M 2 
|te41 — 2| S a |te —2| 
m 


fork = 0, 1, 2... proving inequality (c). This inequality has the form 
2m 


M M\? P 
5 \te+1 — 21S | 5—) lee - 2). 
2m 


Now let px = #4|az — 2|. Then 


2 2 \2 or 
Pri S Pe < (Pei) S++, 80 peti Spo - 


Thus, 


M ie M \" 
—|UR aee —|%ToO —- 2 é 
2m e ~ \ 2m 

Hence, 


2M 94k M 
|zz — 2] < 5r7 > Whereq= Im l% — 2. 


Inequality (a) is thus proven. (Notice that since p < an the above inequality 
proves convergence for zo € K,(z).) 
Finally, to show (b), we have by (2.7) that 


Ine — 21 < S1flax) — FO = <I ee) 


= = If (re) = f{tea) = f' (@e—-1)(@r _ LE-1)| 
= (Rte Ze—1)| (since Vy = Up-1 — aa 


M 2 
— |r — £p-1| . 
2m 


IA 


REMARK 2.10 For f(z) = 0, f’(z) £0, and f € C?(G), there is some 
K,(z) C G such that Newton’s iteration converges for v9 € K,(z). However, 
p may be quite small as indicated by Figure 2.6. 
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REMARK 2.11 The quadratic convergence rate of Newton’s method 
could have been inferred from Theorem 2.4 by analyzing Newton’s method as 
a fixed point iteration. Consider 


Tk+1 = Uk — fi (ar) = g(xr). 
Observe that g(z) = z, 
fe) , FOF) 


f(a) (F(2))? * 


and, usually, g”(z) 4 0. Thus, the quadratic convergence follows from Theo- 
rem 2.4. 


gf (2) =0=1- 


REMARK 2.12 By estimate (c), Newton’s method has a quadratic con- 
vergence rate, 1.e., 
M 2 
|zk+1 — 2| S a(t — 2", 
which is generally very fast. Consider estimate (a). If gq = 0.9 and k = 10, 
then 
g10 zu 2m 


2 
|te —2| < <7 (0.9) = 55 (0.910% = = x 1.39 x 10-47. 


Example 2.12 
Let f(z) = « +e”. Compare bisection, simple fixed-point iteration, and 
Newton’s method. 


f (tx) (a, -e**) (ap, — 1)e”* 
Newton’s method: = LR- = 2, -—-—___~. = S—_— 
e Newton’s method: rp41 = Lr Fx) Lk (+e) ees 
e Fixed-Point (one form): x41 = —e** = g(xx). 


k, | x, (Bisection) | x, (Fixed-Point) | 2; (Newton’s) 
a=-1,b=0 


0 | -0.5 -1.0 -1.0 

1 | -0.75 -0.367879 -0.537883 

2 | -0.625 -0.692201 -0.566987 

3 | -0.5625 -0.500474 -0.567143 

4 | -0.59375 -0.606244 -0.567143 

5 | -0.578125 -0.545396 -0.567143 
10 | -0.566895 -0.568429 -0.567143 
20 | -0.567143 -0.567148 -0.567143 
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2.5 The Univariate Interval Newton Method 


A simple application of the ideas behind Newton’s method and the Mean 
Value Theorem leads to a mathematically rigorous computation of the zeros 
of a function f. In particular, suppose x = [z, %] is an interval, and suppose 
that there isa z € a with f(z) = 0. Let & be any point (such as the midpoint 
of x). Then the Mean Value Theorem (page 3) gives 


O= fle) + f'(E)(z - 4). (2.10) 


Solving (2.10) for z, then applying the fundamental theorem of interval arith- 
metic (page 26) gives 


ya Sad) 
£8) 
eg 24 _nitsa,3). (2.11) 


Thus, any solution z € x of f(x) = 0 must also be in N(f; a, Z). 


DEFINITION 2.4 We call N(f;x,%) the univariate interval Newton 
operator. 


The interval Newton operator forms the basis of a fixed-point type of iter- 
ation of the form 


Lk+1 — N(f; xx, Ex) fork = 1, 25 see 


2.5.1 Existence and Uniqueness Verification 


The interval Newton method is similar in many ways to the traditional 
Newton—Raphson method of Section 2.4 (page 49), but provides a way to use 
floating point arithmetic (with upward and downward roundings) to rigorously 
prove existence and uniqueness of solutions, as well as to provide rigorous 
upper and lower bounds on exact solutions. We now discuss existence and 
uniqueness properties of the interval Newton method. The first result is 


THEOREM 2.6 
Suppose f € C(a) = C((x,7]), £€ wv, and N(f;x,z%) C a. Then there is an 
x* € a such that f(a*) =0. Furthermore, this x* is unique. 


PROOF Assume that = N(f;x,Z) C a. First, observe that 0 ¢ f'(x), 
since f’(x) is in the denominator, and the only way N(f;a,£) can be a single 
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bounded interval is if the denominator does not contain zero. (See the remark 
on extended interval arithmetic on page 24.) Now, apply the Mean Value 
Theorem to obtain 


f(z) = f@ +f (Oe- 4), (2.12) 
and to obtain 

f(@) = f(a) + f'OE- 4), (2.18) 
for some € € @ and some & € @. Solving (2.12) for f(x)/f'(€) and solving 


(2.13) for f(z)/f/(®) gives 


fa) (2 (a) 


ae) <0; (2.14) 


and 


f@ _o_ (2- fo) > 0, (2.15) 


that either 
f(z) <0 and f(z) > 0, 
or 
f(z) >0 and f(z) <0. 
Therefore, by the Intermediate Value Theorem (Theorem 1.1 on page 1), there 
isan x* €  C w such that f(x*) =0. 
To prove uniqueness of x*, assume there is a % € a with f(%) = 0 and 
& #2*. Then, by the Mean Value Theorem, 


0= f(#) — f(a") = f'(n(@- 2"), 


for some 7 € x, 7 between x* and &. Therefore, since #—«* 40, f’(y) =0€ 
f' (a), contradicting our first deduction. Thus, 2* must be unique. 


2.5.2 Interval Newton Algorithm 


Now, let’s study a formal algorithm for the interval Newton method. 


ALGORITHM 2.2 

(The univariate interval Newton method) 
INPUT: @ = [x2], f:« C RR, a maximum number of iterations N, and 
a stopping tolerance e. 
OUTPUT: Either 


1. “solution does not exist within the original x”, or 


2. a new interval 2* such that any x* € x with f(a*) = 0 has x* € a’, 
and one of: 
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(a) “existence and uniqueness verified” and “tolerance met.” 

(b) “existence and uniqueness not verified,” 

(c) “solution does not exist,” or 

(d) “existence and uniqueness verified” but “tolerance not met.” 
1k<-1. 


2. “existence and uniqueness verified” — “false.” 
8. “solution does not exist” — “false.” 
4. DO WHILE k <= N. 


(a) &— (a@+7)/2. 
(ob) IF ¢ ¢ 2 THEN RETURN. 
(c) £— N(f;a,%). 
(d) IF &C @ (that is, if Z > ax and &% <z) THEN 
“existence and uniqueness verified — “true.” 
(ce) IF Nx =0 (that is, if  < x or Z >) THEN 
t. “solution does not exist” — “true.” 
iw. RETURN. 
(f) IF w(%) < « THEN 
wi. tolerance met — “true.” 
wii. RETURN. 
END IF 
(g) L—2NE. 
(h) kk k+1. 


END DO 
5. “tolerance met” <— “false.” 
6. RETURN. 
END ALGORITHM 2.2. 


Notes: 


1. The interval Newton method generally becomes stationary. (That is, the 
end points of # can be proven to not change, under certain assumptions 
on the machine arithmetic.) However, it is good general programming 
practice to enforce an upper limit on the total number of iterations of 
any iterative process, to avoid problems arising from slow convergence, 
etc. 
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2. In Step 4a of Algorithm 2.2, the midpoint is computed approximately, 
and it occasionally occurs (when the interval is very narrow), that the 
machine approximation lies outside the interval. Thus, we need to check 
for this possibility. 


3. Although f is evaluated at a point in the expression 


for N(f;x,Z), the machine must evaluate f with interval arithmetic to 
take account of rounding error. (That is, we start with the computa- 
tions with the degenerate interval [%, ¢].) Otherwise, the results are not 
mathematically rigorous. 


2.5.3 Convergence Rate of the Interval Newton Method 


Similar to the traditional Newton—Raphson method, the interval Newton 
method exhibits quadratic convergence. (This is common knowledge.) An 
example of a specific theorem along these lines is 


THEOREM 2.7 

(Quadratic convergence of the interval Newton method) Suppose f : a — R, 
suppose f € C(x) and f' € C(x), and suppose there is an x* € x such that 
f(x*) = 0. Suppose further that f’ is a first order or higher order interval 
extension of f in the sense of Theorem 1.10 (on page 27). Then, for the initial 
width w(a) sufficiently small, 


w(N(f; x, &)) = O(w(x))’. 


We will not give a proof of Theorem 2.7 here, although Theorem 2.7 is a 
special case of Theorem 6.3, page 222 in [44]. We will illustrate this quadratic 
convergence with 


Example 2.13 

(Taken from [48].) Apply the interval Newton method « — N(f;«, 3), 
&<— fi((z+7)/2), to 

f(z) =2" -2, 
starting with # = [1,2] and ¢ = 1.5. The results for Example 2.13 appear in 
Table 2.1. Here, 
w(x) 

max {maxycw, {|y|}, 1} 
is a scaled version of the width w(x), and py = maxyef(e,){|y|}- The dis- 
played decimal intervals have been rounded out from the corresponding binary 
intervals. 


dp = 
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TABLE 2.1: 
f(a) =a? — 2. 


1.00000000000000, 2.00000000000000 
1.37499999999999, 1.43750000000001 
1.41406249999999, 1.41441761363637 


1.41421355929452, 1.41421356594718 
1.41421356237309, 1.41421356237310 
1.41421356237309, 1.414213562373 10 


5.00 x 107+ 
4.35 x 107? 
2.51 x 1074 
4.70 x 10~° 
4.71 x 10776 
4.71 x 10776 
4.71 x 10776 


Convergence of the interval Newton method with 


2.00 x 10° 

1.09 x 107+ 
5.77 x 107+ 
1.01 x 1078 
1.33 x 107% 
1.33 x 107% 
1.33 x 107% 


1.41421356237309, 1.414213562373 10 


2.6 Secant Method and Muiiller’s Method 


Under certain circumstances, f may have a continuous derivative, but it 
may not be possible to explicitly compute it. This is less true now than in 
the past, because techniques of automatic differentiation (or “computational 
differentiation” ), such as we explain in Section 6.2, page 327, have been devel- 
oped, have become more widely available, and are used in practice. However, 
there are still various situations involving black box functions f. In “black 
box” functions, f is evaluated by some external procedure (such as a software 
system provided by someone other than its user), in which one supplies the 
input x, and the output f(x) is returned, but the user (or the designer of the 
method for finding points «*, f(*) = 0) does not have access to the internal 
workings, so that f’ cannot be easily computed. In such cases, methods that 
converge more rapidly than the method of bisection, but that do not require 
evaluation of f’, are useful. 


Example 2.14 
Suppose we wish to find a zero of 


f(z) =e* -—g (cos + “ sina +Ine) ‘ 
x 


where 
g(x) = Tie + 3x” + 52 + (5+ e* + cosz), 
2x? 
e€ 
h(z) = ———__— 
(x) (1 +4 x 4 x?) ’ 
and a is a constant. ] 


Problems as complicated as this are not uncommon. Prior to widespread 
use of automatic differentiation, applying Newton’s method to this problem 
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was quite difficult because it would have been difficult and time-consuming 
to calculate f’(z,) at each time step. Automatic differentiation is now an 
option for many problems of this type. However, in certain situations, such 
as applying the shooting method to solution of boundary-value problems (see 
the discussion in Chapter 10), f’ cannot be directly computed and the secant 
method is useful. In this section, we will assume that f’ cannot be computed, 
and we will treat f as a “black-box” function. 


2.6.1 The Secant Method 


In the secant method, f’(2,) is approximated by 


fi(ae) © f(%%) — F(e-1) 


Uk — Lk-1 
The secant method thus has the form 


Uk — Lk-1 


7G) =iG@ea) ee) 


Le+1 = Le — f (Lr) 


If f(a.) and f (#41) have opposite signs, then, as with the bisection method, 
there must be an «* between x, and xp4,1 for which f(a*) = 0. 

For the secant method, we need starting values xp and 71. However, only 
one evaluation of the function f is required at each iteration, since f(a,_-1) is 
known from the previous iteration. 

Geometrically, (see figure 2.7), to obtain 2,41, the secant to the curve 
through (a1, f(@e-1)) and (az, f(x%)) is followed to the x-axis. 


(re-1, f (Ze-1)) 


FIGURE 2.7: Geometric interpretation of the secant method. 
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REMARK 2.13 The secant method is not a fixed-point method, since 
Ck41 = G(Lk, Tk-1). 


We now consider convergence of the secant method. 


THEOREM 2.8 
(Convergence of the secant method) Let G be a subset of R containing a zero 
z of f(x). Assume f € C?(G) and there exists an M > 0 such that 
iN 
BE Ke) 
=e aay: 
min | f'(x)| 

Let xo and x, be two initial quesses to z and let 


Kez) = (2-42+6) €G, 


where € = + and é <1. Let 2,21 € K.(z). Then, the iterates x2, x3, Xa, 


- remain in K.(z) and converge to z with error 


1 
late _ al < Te 


REMARK 2.14 (1+ ¥/5)/2 © 1.618, a fractional order of convergence 
between 1 and 2. For Newton’s method |r, — z| < q with q < 1. Compare 
this with the preceding bound. 


PROOF (of Theorem 2.8) Subtracting z from both sides of the secant 
method and multiplying by —1, 


Lk — Lk-1 | 


ea bas fu) 


This can be written as: 


(2 — Ue41) = [-(2 — Fe-1)(z — 2)] S| , where (since f(z) = 0) 


Dr 
c, = fO=few) _ Hei) = fle) Vesen 
an: Ge — Bei 
and D;, — F(t) = F(@e-1) 
Lk — Lk-1 


Now, let e, = z — xz. Then from the above equation, 


ICe| 
€rt1| S lex—1] |ex| ——- 
Jeneal < lex—al eels 
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However, the Mean Value Theorem gives 


Dy = f'(Ex) 


for some &; in the interval containing x,_; and xz. Similarly, 


Cr = f"(Ce)/2 


for some ¢;, in the interval containing x,, x, 1, and z. To see this, suppose 
that z > x, > Xp~-1 (although true in general). Then Taylor’s theorem gives 


aN (z = Zp)? 


F(z) = Fee) + Fw) (2 — @x) + FG) —Z 
and 


(ap-1 — @%)? 


fan) = Flax) + £"ex)(eea — 24) + I" Ge) 


Substituting into Ck, 


o, = Ge (=) LG) @ ae t'(Ce) 


Z—XLpE-1 2 Z—- Up] 


2 


2 


by the Intermediate Value Theorem, since 


Z— Xk Lk UeE-1 1 
Z—-Ep-1 ZF — Le-1 
Also, ¢x is in the interval containing z, r,_1, and xg, since ¢, is between z 


and x, and Cx is between xr,_; and xz. Thus, 


fe) 


2f'(Ek) 


for some ¢, and &, in the interval containing z, x,, and x,_1. When k = 1, 
we have 


2 


lentil S lena] lex (2.17) 


|e2| < Jes| leo] M. 


Thus, 
M|e2| = Meo|M|ei| SS 6? if X0,X1 € K.(z). 


Also, |e2| < «7M = eM <€,s0 vq € K.(z) if xo, a1 € K.(z). Thus, 
M|ep| < M|eo|M |e1| and \e2| < €. 
Consider now 


les| < MJea| lei] < € 
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and 
Mle3| < Mle2|M|eq| < 6°. 
Thus, in the k-th step, if je,| << « and Mle,| < 6%, then 
lex+il <€ 

and 

Mlex+1| < Mlex|Mlex_1| < 6 TH -2 = 5H, 
Hence, 


(2.18) 


Qk+1 = Wk + Vk-1; 
go=nu=1, 


that is, {qx}~29 is the Fibonacci sequence. To solve the difference equa- 
tion (2.18), we can assume the form q, = a”. We then obtain 


k+1 


a =a®+a*}, 


so a? =a+1. Solving this quadratic equation gives 


= as 
rr oe 
so 
k k 
1+ v5 1-5 
dk = Co +¢1 with gq =qm =1. 
2 2 
Hence, 
k k 
fom (1+v5) (/1+v5\ — - v5) (1- v5 
aes 2 V5 D 
Therefore, 
k 
1 5 
ae ( 54) (2.19) 
for k = 1, 2, 3, .... Thus, 
(48)' 
Mlex| <6\? 


for k = 1, 2, 3,---, so vp ~ zask— ow. 
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2.6.2 Miuller’s Method 


We now consider a variant of the secant method called Miiller’s Method. 


REMARK 2.15 = Miiller’s method can be used to obtain real and complex 
roots of a function and thus is often used to find zeros of polynomials. 


To define Miller’s method, it is useful to use divided differences. (There is 
more on divided differences in Section 4.3.) 


DEFINITION 2.5 We define 


fo, #1, t2] = 


to be the second-order divided difference. 


REMARK 2.16 It can be shown that f[zo, 71] = f’(&) for some € between 
Xo and x; and f[xo, 21,22] = $f"(¢) for some ¢ between xo, %1, and 22. ] 


We first present a geometric interpretation of Miiller’s method. Suppose 
that x,, ep—-1, and x,_2 are approximations to z. To find 2,441, we follow the 
quadratic polynomial through the points (xz, f(x%)), (we—1, f(@e—1)), and 
(tp-2, f(e—2)) to the x-axis. Let p(x) be this quadratic polynomial. Then, 


p(x) = f (te) + (@ — wx) flee, Te-1] + (@ — TK) (@ — Te-1) f (Tk, Lk—-1, Te—2].- 
(2.20) 
(See §4.3.4.) Thus, setting © = rp41, 


O = f(r) + (Gep1 — Te) f (ee, Te-1] (2.21) 

+ (te41 — Ck) (Ge41 — Le—1) Ff [e, Ce-1, Ce—2]. 
Solving for 2,41 gives us Miiller’s method. (See Figure 2.8.) That is, we 
find the root of the quadratic through xp_2, rp_-1, and xz to obtain r,41. 


Rearranging to reduce rounding errors due to subtracting of nearly equal 
quantities, we obtain 


2rr 
TE OES: TR eg 
for k = 2, 3, ---, where 


(2.22) 


Sir P(x) ee Plte, Pea, Te- 2] 
Wk Wk 
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y 


(25-41, f(a) 


(Lp—2, f (€e-2)) 


FIGURE 2.8: Geometric interpretation of Miiller’s method. 


and 


wr = fle, Ce—1] + (Ce — Ce-1) f [Ley Le—-1, Tk—2]- 


REMARK 2.17 The sign has been chosen in the quadratic formula for 
(2.22) so that x,+41 is the nearest solution to x, of the quadratic equation 
(2.21). 


REMARK 2.18 _ It can be shown that the order of convergence of Miiller’s 
method is about 1.83, which is a little faster than the secant method but slower 
than Newton’s method. 


REMARK 2.19 If (1—4x.Ax%) < 0, then 2,41 is complex and the method, 
if convergent, converges to complex roots. However, even if all the roots are 
real, complex arithmetic must be used, because 1 — 44u,A, may be less than 
zero at some iteration k. 


2.7 Aitken Acceleration and Steffensen’s Method 


Linearly convergent sequences are obtained in many numerical methods, 
as we have seen in examples of fixed point iteration. In this section, we 
study a method useful for accelerating the convergence of linearly convergent 
sequences. 
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2.7.1 Aitken Acceleration 


In this derivation, assume that {x;,} is a linearly convergent sequence with 
limit a. That is, 


For example, for the fixed-point method with continuous g’, A = g'(a), since 


a— Se41 = (2) — g(te) = 9'(Ex)(a — ze), 
for some &, between a and z,. Assume now that & is sufficiently large to 
ensure that (41 — @) © A(x, — a). Then 


(p41 — @) & Axe — a) (2.23) 
and, replacing k by k+1, 
(Teo — a) & A(eR41 — @). (2.24) 
Solving (2.23) and (2.24) for @ yields 


(ta41 — x)? 


a oR 2.25 
* (@ee2 — Beet) — (@e41 — Be) ee) 
(Te42 — Trai)? 
= £442 — 
(Ce+2 — Le+1) — (Le+1 — Le) 
It is therefore reasonable to define another sequence {%;,} by 
meey: 

fh = th — (@a4a = @) fork=0,1,2,---. (2.26) 


(Ce42 — Ce41) — (Ce41 — Le) 


We would expect {#,} to converge to a faster than {x,}. This procedure is 
called Aitken acceleration or extrapolation. 


Example 2.15 
Consider the fixed-point iteration 
2 
6 
nti = ote) = ES, k=0,1,2,---. 


In this problem, a = 2. 


Uk Tk 
0 1.57895 
1.2 1.82284 


1.488 | 1.90213 
1.64283 | 1.94259 
1.73978 | 1.96516 
1.80536 | 1.97853 
1.85187 
1.88588 


NOOR WNF Oe 
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In this example, Aitken acceleration clearly improves the convergence. 


Example 2.16 
This example illustrates that Aitken acceleration is not limited to fixed-point 
iteration. Consider 


1 2* 
1 h 
i f(a)dx = S~ he f (ai) where hy = 5p? ti = the - = 


i=l 


for i = 1,2,--- ,2*, for integers k > 0. This method is the composite mid- 
point rule for approximating the integral, and can be shown to be linearly 
convergent. Let f(a) = a#~'/? and 


gk 
Sk = by hy f (ai). 
i=l 


The values obtained for this example are given in the following table: 


1.69884 | 2.00325 
1.78646 
1.84886 


k 
1 
2 
3 
4 


1 
Notice that a = i a dx = 2, so Aitken acceleration is effective in this 
0 


example. 


We have the following convergence result for Aitken acceleration. 


THEOREM 2.9 
Let {x,} be a linearly convergent sequence with limit a, such that the quan- 
tities dy = x, — a satisfy dra1 = (A+ €x)dx, where A is constant, |A| < 1 
and €, + 0 as k > co. Then the sequence {%,} derived from {xz} by (2.26) 
converges to a faster than {xz}. That is, 

Sk" _.0.ask > 0. (2.27) 
Le — 


PROOF We have 


depa = (A+ €xpi)dega = (A+ €xpi)(At €x)de- 
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Hence, 


(Ce+2 — Le+1) — (Te+1 — Lk) = (dete — dei) — (de+1 — de) 
= [(A-1)? +.&] de, 


where 
/ 
€ = Alex + €h41) — 2€x + En€n+1- 


Since €, — 0, it is clear that «, — 0 as k — oo. Thus, for k sufficiently large, 
say k > ko, it follows that 


(A— 1)? + & 40, 


so 


(repo — Te41) — (e41 — Le) FO, 


so {%,} is defined for k > ko. We also have 
Cp41 — Ce = d+ —d, = (A + €e — 1)dg. 
Subtracting a from both sides of (2.26), 


(Wp _ cea es Sy (A —-l1+ €x) dk 
(Ce+2 — Le+1) — (Ce+1 — Te) (AL)? t¢, 
Ee — 2€K(A — Ye 


=o @esiaes 


Lp -a=dz,—- 


recalling that d, = x, — a. Hence, 


Tp7@ _ &,— 2en(A- 1)-e Bee eae 
Lk — O (A- 1)? +4, 


0 


A relevant question is: Since all sequences are not linearly convergent, how 
do we determine when to apply Aitken acceleration? A measure for deter- 
mining wher fo use Aitken acceleration is discussed in [41]. First, it can be 
shown that 2 og B= = a = = X, ie., the errors are linearly convergent, if and 
only if the points (xo, 21), (%1,%2), (%2, 23) lie on a straight line. Therefore, 
how closely a line fits these points indicates how well the errors are linearly 
converging. The serial correlation coefficient, a measure of the linearity of the 


three points, is given by 


se 8283/3 
(84 — 83/3)(85 — 83/3) 


— 
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where 


3 
_ SS 2 
=) gies 52 = ya 1, 83> Df = as and 
i=l 
3 
=S°2?. 
i=l 


If r is close to +1, e.g., |r| > 0.999, then Aitken acceleration is most effective. 
For Example 2.16, r + 0.99998. 


2.7.2 Steffensen’s Method 


Aitken acceleration applied specifically to fixed-point iteration gives rise 
to a procedure known as Steffensen’s method, which can be shown to be 
quadratically convergent under certain conditions. Consider f(x) = 0 and 
Ceti = f(tn) + ax. (That is, g(x) = « + f(x).) Then, Aitken acceleration 
gives 


(tr41 — Le)? 


peep iC: ) C9) 
f(x%+1) — f (tx) f(f (an) + e~) — f(x) 


Letting x, = %,_1 on the left hand side gives 


By = feo — 


(f(@n-1))? 
Ges) GS) (2.28) 


which is Steffensen’s method. 


REMARK 2.20 = Steffensen’s method is a fixed-point method 
y= S(Zp-1) 


with 


REMARK 2.21 — Steffensen’s method applied to a fixed-point iteration 
Ce+1 = g(x~) has the form xp41 = H(z), with 


(9(@n) — ax)? 


POE Gana Gee = aan). 
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REMARK 2.22 By Remark 2.20, if f(a) = 0, then S(a) = a and S’(a) = 
0. Thus, by Theorem 2.4, page 46, if Steffensen’s method is convergent, then 
it can converge quadratically. (You will show this in Exercise 16.) 


Example 2.17 
Comparison of Newton’s method and Steffensen’s method for f(x) = x? — 4. 
(Note that a = 2.) 


2 
xy, —4 


20k 


Newton’s method: tp41 = Lp - 


Steffensen’s method: 


(2? — 4)? (ap — 2) (x +2) 


Uk+1 = Lk — Lk — 
a ( x7 + 2x, —4 


ae —44+a,)2-4-a2 +4 


Newton’s | Steffensen’s 
3 3 

2.16667 2.545 
2.00641 2.218 
2.0000102 | 2.0463 
2.0000000 | 2.00253 
2.0000000 | 2.000008 


k 
0 
1 
2 
3 
4 
5 


(We see quadratic convergence for both methods, although Newton’s method 
is faster.) 


2.8 Roots of Polynomials 
In this section, we consider the special case of finding the roots of 
f(x) = p(x) = an t+ aya +--+ + anx”, 
which is an important but difficult problem. We first review some important 


results on roots of polynomials. 


THEOREM 2.10 
(Fundamental theorem of algebra) If p(x) is a polynomial of degree n > 1, 
then p(x) = 0 has at least one root (possibly complex). 
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COROLLARY 2.1 
If 
p(x) = ap aya +--+ +an2”, 


then there are n roots 21, 22, ..., 2n € C (which may be repeated), and 


p(@) = Gn(a — 21)(4 — 22) +++ (@ — Zn). 


THEOREM 2.11 

(Descartes’ rule of signs) Let p be a polynomial with real coefficients, let 11 
be the number of changes in the sign of the coefficients of p(x), and let v2 be 
the number of changes in the sign of the coefficients of p(—ax). Let ky and 
kg be the number of positive real roots and negative real roots of p(x). Then 
ky <1 and 4, — ky is even, kg < v2 and v2 — kg is even. 


Example 2.18 
Consider p(x) = x” —x—a, with r > 1 anda > 0. By Corollary 2.1, there are 


r real and complex roots of p(x). Consider p(—ax) = (—1)"2" + #—a. Clearly, 


1, 
2, rodd. 


reven 


n= Land = | 


Thus, k; = 1, since 1, — ky must be even; ko = 1 if ve = 1 and ko = 2 or Oif 
V2 = 2. Therefore, if r is even, there are one positive root and one negative 
root. If r is odd, then there are one positive root and two or zero negative 
roots. 


We now have the following useful results on upper and lower bounds on the 
roots of a polynomial [8]. 


Result 1: max |z,| < 1+ max Be (All roots lie in a disc in the 
k 0<i<n—1] an 


complex plane of radius R; see Figure 2.9.) 


Result 2: p1 < |z,| < p2 for all k, where p; is the unique positive root of 
lan|z” + |@n—ijz"* +--+ + lai|a — lao] = 0 
and p2 is the unique positive root of 
lan le” — lan—1ie"—* —-+-— |aile — |ao| = 0: 


(All roots lie in an annulus in the complex plane; see Figure 2.10.) 


REMARK 2.23 An efficient way to computationally evaluate a polyno- 
mial is using nested multiplication. 


p(x) = a9 4+ x(a, + (a2 + w(ag ++++ + 2(Gn—1 + nx) ---) 
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Im 


FIGURE 2.9: Disc in which roots must lie. 


Im 


Re 


Mn 
WD 


FIGURE 2.10: Annulus in which roots must lie. 
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Using nested multiplication, there are n multiplications and n additions in 
comparison with 2n — 1 multiplications and n additions if p(x) is evaluated 


directly. 


The following elementary but useful result relates nested multiplication to 


division of a polynomial by a linear factor. 


THEOREM 2.12 
(Horner’s method or synthetic division) Let 


p(x) = ao taya +--+ an”. 


Let by = an and by = ag + byyiz fork =n—1, n—2,...,0. Then bo = p(z). 


Moreover, if 


q(x) = by + bow +--+ b,x”) 


1This assumes that, in “direct evaluation,” x* 


is evaluated as «-x*—! fork =2,...,n. 
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then 
p(x) = (x — z)q(x) + bo. 


PROOF We have 


bo+ (x — z)q(z) 
= (¢ —z)(bi + bee +--+ b,0"-+) + do 
= bya + boa? + +++ + dpe” — (biz + boza +--+ +byze"—") + dp 
= bya" + (bp —b,2z)a * + (Opa — by az)a 7 + est (09 — 612) 


= nt” + One" | + ange”? ++ Fax + ao. 


Thus, p(x) = bo + (a — z)q(x) and bo = p(z). 
Now consider a general procedure for finding the roots of a polynomial p(x). 
ALGORITHM 2.3 
(Deflation method for finding all roots of a polynomial) 
INPUT: the coefficients {a;}/'_) of the polynomial p(x) = 3 apa. 
OUTPUT: approximations to the roots of the polynomial ne 


1. Find root z; using some procedure such as Miiller’s method or Newton’s 
method.” 


2. Deflate p(x) = (a — 21)q(x) using Horner’s method. 


3. Find root zg of q(x) using the numerical method in Step 1. Note that 
q(x) is of degree n — 1. 


4. Continue deflating and calculating roots. 


5. After finding each root, iterate on that root, applying the method in 
Step 1, using the original polynomial p(x). 


END ALGORITHM 2.3. 


REMARK 2.24 The roots may be sensitive to small changes in the 
coefficients. Thus, the errors introduced by rounding and approximation of 
Z 1, for example, can lead to large errors later. To help correct this problem, 
step (5) is added. 


*Here, the root may be complex, so we will use complex arithmetic, and modify our geo- 
metric view of Newton’s method or Miiller’s method. 
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To understand the sensitivity of the roots to perturbations in the coeffi- 
cients, let’s perform a stability analysis. Consider 


n 


p(t) =anptaet+---+ ane 
and 

q(x) = bo + bia +--+ + bn”, 
where q(x) is a perturbation to p(x). Let 


p(x, 6) = p(z) + €q(2) 


be the perturbed polynomial. Assume that z,; is a simple root of p(x). (Mul- 
tiple roots can be considered in a similar manner. See, for example, [8].) 
Then 


co 
a(e) = 25+ Dove 
&=0 


is a Taylor series for z;(€) as a function of €, and 7 = 2/(0). Thus, 2;(€)— 2; © 
z;(0)e is an approximation to the error in the roots due to perturbation of 
p(x) by eq(x). But p(z;(e)) + €q(z;(e)) = 0 and taking the derivative with 
respect to € gives 


Therefore, 
z(e) = —9(2; (€)) 
: p'(z;(€)) + €d'(2;(€)) 
Hence, 
qj) 
z;(€) & é, 
i( ) ) p'(z;) 
where z;(€) is the root of p(x) + eq(x) and z; is the root of p(z). Ul 


Example 2.19 
(Conditioning of a Wilkinson polynomial of degree 7) Let 


7 
p(x) = | [(e — #) =a" — 282° + 3220° +... — 5040, 
i=1 
let q(x) = x®, and let « = —0.002. (That is, we are introducing a small change 


in coefficient of x°) of p(x). Then 


_aeje_ Pe, _§°(0.002) 
Pe)" TeaG-9 7” TG 9) 


74 Classical and Modern Numerical Analysis 


Hence, 


z7(€) & 7.33 for 27 = 7, 
ze(€) © 5.22 for z = 6, 
z5(€) © 5.65 for z5 =5. 


Thus, the roots of p are unstable to small changes in the coefficients. The 
polynomial p is said to be ill-conditioned. We may need to employ double- 
or multiple-precision arithmetic to obtain better results, or even use interval 
arithmetic to check the results. 


REMARK 2.25 Another method of finding roots of polynomials is to 
numerically find the eigenvalues of the companion matrix 


0 1 0 0 
0 O 1 0 
G = 
0 O 0 1 
ao ay, aQ... An—1- 


Roots of the polynomial p(x) = 2” + an_12"~! +--+ + ao are the same as 
the eigenvalues of C’, since det(AJ — C) = p(A). (See Chapter 5, starting on 
page 291, for a review of eigenvalues and eigenvectors, and a description of 
how to compute them numerically.) 


REMARK 2.26 An alternate method, for finding rigorous bounds on 
all roots of a polynomial makes use of interval Newton methods combined 
with a systematic subdivision of a region in which all roots must lie. These 
techniques, special cases of branch and bound methods, are considered in 
Section 9.6.3 on page 523. 


2.9 Additional Notes and Summary of Chapter 2 


REMARK 2.27 Suppose that f(x) = (a — z)"q(a), where r > 1 and 
lim g(x) #4 0. Function f is said to have the zero z of multiplicity r. Notice 


that if r > 1, then f(z) = f’(z) = 0. Hence, there may be convergence 
problems as x, — z for Newton’s method (and also the secant method), since 


f (xx) 
f'(en) 


Tk+1 = Lk — 
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(Recall Theorem 2.5 on page 50, where we analyze the convergence of New- 
ton’s method. There, we required |f’(x)| > m > 0 for x near z. Also, see 
Exercise 22 on page 80.) Can we reduce this problem? Let g(x) = f(x)/f'(z). 
Then 
ie (x — 2)"a(e) ___ @= 2a) 
r(x — 2)"t9(a) + (w—2)"q'(a) — rq(x) + (a — z)q'(x) 

Hence, g(a) has a zero of multiplicity 1 at z and g’(z) #0. If we apply New- 
ton’s method to g(x) rather than to f(x), our method is called the modified 
Newton’s method, and works well for zeros of multiplicity greater than one. 


Thus, 
Bf ois cin, cs ai f(@x) f"(&x) 
si 9 (xx) (f'(ve))? — (xn) f" (te) 
is the modified Newton’s method. 


Example 2.20 
f(z) =2* —4¢7 + 4= (2 — V2)? (44+ V2). 


REMARK 2.28 Some of the advantages and disadvantages of the root- 
finding methods considered in this chapter appear in Table 2.2 (on page 84). 


2.10 Exercises 


1. Consider the method of bisection applied to f(x) = arctan(x), with 
initial interval # = [—4.9,5.1]. 
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(a) Are the hypotheses under which the method of bisection converges 
valid? If so, then how many iterations would it take to obtain the 
solution to within an absolute error of 1072? 

(b) Apply Algorithm 2.1 with pencil and paper, until k = 5, arranging 
your computations carefully so you gain some intuition into the 
process. 


2. Let f and x be as in Problem 1. 


(a) Program Algorithm 2.1 in your chosen programming language. 
Print ax, be, flax), f(bx), and f(x,) for each step, so you can 
see what is happening. 

(b) Try to solve f(x) = 0 with e = 10-7, « = 10-4,e = 10-8, e = 10, 
e= 10° "7; <= 10°; and'e=10° ©: 

i. For each €, compute the k at which the algorithm should stop. 
ii. What behavior do you actually observe in the algorithm? Can 
you explain this behavior? 
3. Repeat Problem 2, but with f(x) = x? —2 and initial interval a = [1,2]. 


4. Use the program for the bisection method in Problem 2 to find an ap- 
proximation to 10007 which is correct to within 107°. 


5. Suppose that f € C[a,b] has a unique zero x* € [a,b], f(a) < 0 and 
f(b) > 0. Define the two sequences {x,}?2.9 and {yx }P29 by vo = a and 
yo = 6 and for k = 1, 2, 3, ... as follows. 


(i) If f ore <0 then xr, = oe and yp = YR—13 


LR-1 + Yk-1 


(ii) if f (Apu) > 0 then x, = xp_1 and yz, = 5 


Prove that 
(a) a € [rn, yx], f(@e) <O and f(yx) > 0 for k= 0, 1, 2,.... 


b-a 
Qk 


6. Consider g(x) = x — arctan(2). 


(b) lar —2*| S 


fork =0,1,2,.... 


(a) Perform 10 iterations of the fixed point method rg41 = g(x), 
starting with e=5,%=-5,r%=1,¢%=-l,andz=0.1. 


(b) Explain the behavior you observe in terms of the theory in this 
section. 
7. g(x) = we + at has 2 = \/a as a fixed point. Suppose we use fixed point 


2 2x 
iteration, with a starting point rp > Va. 
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a) Use the theory from this section to determine a startin point XO; 
g 
such that 


1 
|zn41 — Val < glee — val, 
fork > 1. 


(b) Perform a few iterations of the fixed point method on this problem 
(either by hand or with a short computer program). 


i. What is the convergence rate you observe? (That is, is it linear 
or quadratic? If it is linear then what is the constant by which 
the error is approximately multiplied on each iteration?) 

ii. Explain the behavior you observe in terms of the theory of this 
section. 


8. To find a root of f(x) = 0 by iteration, rewrite the equation as 


=a tef(e) = g(x) 


for some constant c 4 0. If a is a root of f(x), if f’(x) is continuous 
near x = a, and if f’(a) 4 0, how should c be chosen to ensure that the 
sequence 241 = g(a) converges to a, for x sufficiently close to a? 


9. Let a be a root of f(x) and define an iteration formula by 


f (Zeta) f(x) 


Uk+1 = %k+1 — “F(ae)” Zk4+1 = Uk — f(x) 


Show that the order of convergence of {2,} to a is at least 3. 
10. It is desired to find the positive real root of the equation x? +a2?—1=0. 


(a) Find an interval # = [x,%] and a suitable fixed point iteration func- 
tion g(x) to accomplish this. Verify all conditions of the contraction 
mapping theorem. 


(b) Find the minimum number of iterations n needed so that the abso- 
lute error in the n-th approximation to the root is correct to 107+. 
Also, use the fixed-point iteration method (with the g you deter- 
mined in part (a)) to determine this positive real root accurate to 
within 107+. 


11. Find an approximation to 10002 correct to within 1075 using the fixed 
point iteration method. 


12. Consider the sequence defined by xp41 = b+ € h? (xz) for k = 0, 1, 2, 
..., where Zo € R is given and b,e € R. Assume that 


(a) J M > 0 such that, |h(x)| < M for all « € R, and 
(b) 3 £ > 0 such that, |h(v) — h(w)| < Ll|u — w| for all v,w € R. 
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13. 


14. 


15. 


16. 


17. 


18. 
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Prove that if 2M/Lle| < 1, there is a unique z € R such that z = 
b+ h?(z) and the sequence {x,}?°.9 converges to this value z. 


Consider the sequence {2% }?° defined by 


Tr41 = G(€k) = yf (er) + h(ce) 


for k = 0, 1, 2,..., where y > 0, 29 ECR, f,h: RR, and |h’(az)| < 3 
for alla € R. Assume that f is Lipschitz with Lipschitz constant L. 
Find a condition on y that will guarantee convergence of the sequence 
{tz }P~2p to the unique value z € R such that z = g(z). 


Let . 
lia 2 O 
=5|,-2?--ar+4 
g(x) aoe ge + 
and G = [0,2]. Use the contraction mapping theorem to prove that if 
xo € G, then the sequence defined by xx41 = g(ax), k = 0, 1, 2,..., 
converges to the unique fixed point z € G. 


x 


Consider the function g(x) = e 


a) Show that g(x) has an unique fixed point z € (—oo, 00). 


(a) 
(b) Prove that g is a contraction on [In 1.1, In3]J. 
(c) Prove that g: [In1.1,ln3] — [In 1.1, In3}. 

) 


(d) Prove that v.41 = g(a,) converges to the unique fixed point z € 
(—o0, co) for any x € (—co, 00). 


Assume that f € C?(—oo, 00), f(z) = 0, and $”(z) #0, where S is as in 
Remark 2.20 on page 68. Show that Steffensen’s method (Formula 2.28 
on page 68) is of second order. That is, show that Steffensen’s method 
exhibits quadratic convergence. 


Consider the fixed point iteration method 2,41 = g(x), k =0,1,... for 
solving the nonlinear equation f(a) = 0. Consider choosing an iteration 
function of the form 


g(x) = 4 —af(x) — (f(x)? — (f(a), 


where a, b, and c are parameters to be determined. Find expressions for 
the parameters a, b, and c such that the iteration method is of fourth 
order. 


Consider the iterative procedure 


h 
vi” = 45 +5 [FO + FQ)], for k= 0,1,2,-.. 
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where y; € R is given, f : R — R, and ge = y;. Suppose that f is 
Lipschitz on R with Lipschitz constant L. Prove that if AE < 1, then 


ie — yj4+1 as k — oo, where y;+41 satisfies 
h 
yer = 45 + 5 [Flus) + Fys+4)I- 


19. Assume that the equation x? + br + c = 0 has two real roots a and 3 
with |a| < |G]. Show that the iteration method 
c 


Tk+1 = rary 


is convergent to a if Zo is sufficiently close to a. 
20. Consider f(x) = arctan(x). This function has a unique zero z = 0. 


(a) Find a radius p as in Theorem 2.5 (on page 50) such that the 
conclusions of Theorem 2.5 are true with K,(0) = [—p, pl. 

(b) Use a digital computer with double precision arithmetic to do it- 
erations of Newton’s method, starting with xp = 0.5, 1.0, 1.3, 1.4, 
1.35, 1.375, 1.3875, 1.39375, 1.390625, 1.3921875. Iterate until one 
of the following occurs: 

° |f(x)| < 107, 
e an operation exception occurs, or 
e 20 iterations are completed. 


(i) Describe the behavior you observe. 
(ii) Explain the behavior you observe in terms of the graph of f. 
(iii) Evidently, there is a point p such that, if a9 > p, then New- 
ton’s method diverges, and if ro < p, then Newton’s method 
converges. 
(a) What would happen if zo = p exactly? Illustrate what 
would happen on a graph of f. 


(3) Do you think we could choose xp = p exactly in practice? 

21. Let f(x) = x2? —a. 
(a) Write down and simplify the Newton’s method iteration equation 

for f(x) = 0. 


(b) Assume a > 1. For xp = a, write down the constants M and m as 
in Theorem 2.5. 


(c) For a = 2, form a table of 15 iterations of Newton’s method, start- 
ing with xp = 2, 9 = 4, op = 8, Lo = 16, Xo = 32, and zo = 64. 


(d) Explain your results in terms of the shape of the graph of f and in 
terms of the convergence theory in this section. 
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22. 


23. 


24. 


25. 


26. 


27. 
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(e) Compare your analysis here to the analysis you did for Problem 7 
in this set and Example 2.9 on page 47. 


Suppose z is a double root for f, ie., f(z) = f(z) =0F f"(z). Show 
that if f” is continuous and if Newton’s method converges then it con- 
verges linearly in this case. In particular, en) ¥ 0.5e€n. 


Let f : [a,b] — R be a C? function satisfying f’(x) 4 0 for x € [a, }]. 
Let {x,}°2.) be a Newton iteration sequence for solving f(x) = 0, ie., 
Xx Satisfies the following equation: 


f(&r-1) 
f(@e-1) 


Assume xz € (a,b) and limp... 7% = 7. Show that f(r) = 0 and that, if 
f'(x) £0 on [a, 6], then 


Lk = Le-1 — 


Lp —-7T| < max : 
| | x€ [a,b] | f'(x)| 


By writing g(x) = g'(z) + 9" (cx)(x — z), use the line of reasoning in 
Remark 2.11 (page 53) to show that Newton’s method is quadratically 
convergent. 


Using Newton’s method, establish the iterative scheme 
Ck+1 = xp (2 = Rexx) 


to calculate the reciprocal of a number R. Also, show that this iterative 
scheme yields quadratic convergence. 


Describe Newton’s method to compute the value of x that satisfies the 
equation | et dt=1. 
0 

f(x) 
f'(&x) 
(i) f’(x) is continuous and f’(x) > 0 for z ER. 
(ii) f(y) — f(x) 2 f(a)(y— 2) for z,yeR. 
(iii) f(0) <0, ao > 0 and f(x) > 0. 


Then, 


Consider Newton’s method xp41 = rp - . Suppose that: 


aS 


(a) Prove, by induction, that 7p > 11 > @g >-++ > tpy41 > +--+ > 0. 


(b) Prove that x, — z as n > co, where z satisfies f(z) = 0. 
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28. Hint: Use of Mathematica’s interval arithmetic or the free INTLAB tool- 
box for MATLAB is recommended. There are also freely available For- 
tran 90 and C++ packages for interval arithmetic. 


(a) Let f be as in Problem 20 of this set. Experiment with the interval 
Newton method for this problem, and with various intervals that 
contain zero. Explain what you have found. 


(b) Use the interval Newton method to prove that there exists a unique 
solution to f(a) = 0 for « € [—1,0], where f(z) = a+ e”. 


(c) Iterate the interval Newton method to find as narrow bounds as 
possible on the solution proven to exist in part 28b. 


(d) For comparison, attempt to use Theorem 2.5 (on page 50) to prove 
existence and uniqueness of a root of f within [—1, 0]. 


29. An alternate form of interval Newton method is the Krawczyk method. 
The Krawczyk method is based on finding a fixed point of g, where 


g(x) = x — f(x)/f'(Z), where & is some point in a. 


The Krawczyk method is derived by using the mean value extension 
(See Problem 26 in Chapter 1.): 


g(x) € o(%) + g'(x)(a — %), provided x € x and&% Ea. 


Suppose: 
shy SAID fen Ne Ba Ni te 
(a) g(@) =2- F504 (1- a F'@) @-H ca 
f@)) |, £@) 
() mae (0— gy) ~ mash a5] 


Prove that there is a unique solution of f(a) = 0 in a. (See page 156 
for the formal definition of mag(a).) 


30. Show how Equation (2.19) follows from the derivation preceding it. 


31. Repeat Exercise 20b, page 79, but with the secant method instead of 
Newton’s method. (Use pairs of starting points {0.5,1.0}, {1.0,1.3}, 
etc.) 


32. Verify that, in Equation (2.20) on page 63, p(a,) = f(xK), p(te-1) = 
f(x-1), and p(re-2) = f (TK-2). 


33. Repeat Exercise 20, page 79, but with Miiller’s method instead of New- 
ton’s method. (Use triplets of starting points {0.5, 1.0, 1.3}, {1.0, 1.3, 1.4}, 
etc.) 
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34. Repeat Exercise 20b, page 79, but with Steffensen’s method instead of 
Newton’s method. 


35. Do three steps of Newton’s method, using complex arithmetic, for the 
function f(z) = 22 +1, with starting guess z = 0.2 +0.7i. Although 
you may use a computer program, you should show intermediate re- 
sults, including z,, f(z,), and f’(z,). (Note: Newton’s method with 
complex arithmetic can be viewed as a multivariate Newton method in 
two variables; see Exercise 9 on page 483, in Section 8.3.) 


36. Example 2.19 deals with the Wilkinson? polynomial of degree 7, where 
the Wilkinson polynomial of degree n is defined to be 


These polynomials have roots that are notoriously sensitive to the co- 
efficients of the polynomial in power form, so they are interesting test 
problems. 


(a) Find ap through a7 exactly, as integers (by using, say, the “Expand” 
function in Mathematica, or something similar in your favorite 
symbolic manipulation program, or else by looking them up.) 


(b) Program Algorithm 2.3 (on page 72). Here, you will need to make 
some choices, such as the method you will use in Step 1 and how 
you will choose the starting points.* 


(c) Test your program from part 36b on several simple examples, such 
as p(x) = x? + 3x + 2 or p(x) = x° — 1, to make sure it is working 
correctly. 


(d) Using the coefficients from part 36a, try your program on the 
Wilkinson polynomial. Do you get the same thing as Example 2.19? 


(e) Try your program for the Wilkinson polynomial of degree n = 
20 (after repeating part 36a for n = 20.) Do you get reasonable 
answers? 


3J. H. Wilkinson was a famous numerical analyst in the middle of the twentieth century, at 
the beginning of the era of modern digital computers. He invented the concept of “backward 
error analysis,” that is, the technique of showing that roundoff errors produce an answer 
that is the exact answer to a nearby problem. He is also a father of modern numerical linear 
algebra; one of his works is the monograph The Algebraic Eigenvalue Problem [101]. The 
Society of Industrial and Applied Mathematicians awards the “Wilkinson Prize,” in honor 
of J. H. Wilkinson, for outstanding numerical software packages. 

4One way to choose the starting points is to compute R as in “Result 1” on page 70, and 
choose the initial guess in the disk. 
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(f) Do a stability analysis similar to Example 2.19, for the n = 20 
case. Is your analysis consistent with your results from part 36¢e? 
(Note that, although the coefficients are integers, some of them 
have more digits than can be represented internally in an IEEE 
double precision number, so that the polynomial that the computer 
is actually storing in its memory is a perturbation of the Wilkinson 
polynomial. Estimate the size of that perturbation.) 


37. Write a program that combines Newton’s, Horner’s, and Miiller’s meth- 
ods to obtain all the roots of the following polynomials. 
(I) P(x) =2a*-1. 
(II) P(x) = 2° —5a* — 823 + 40x? — 9x + 45. 
(III) P(x) = 112° — 5a4* — 8x3 + 40x? — 9x + 45. 


Use the following steps in your implementation. 


e Repeat the following until all the approximate roots of P(x) are 
determined. 
(a) Employ Miiller’s method to determine a root z of P(x). 
(b) Deflate P(x) = (a — z)Q(x) using Horner’s algorithm. (That 
is, use Algorithm 2.3 on page 72.) 
(c) Replace P(x) by Q(x). 
(d) Go to Step (a). 
e Using each approximate root obtained as an initial guess, apply 


Newton’s method to obtain better approximations for the roots of 
the original polynomial. 


Explain your observation for the answers obtained for (II) and (III). 


38. Does your program from Problem 37 work well for the Wilkinson poly- 
nomial (Example 2.19 on page 73) of degree 7? What about degree 10? 
What about degree 20? 
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TABLE 2.2: A summary of the methods considered in Chapter 2. 


Metho 


Bisection 


Fixed-point 


Newton’s 


Interval 


Newton 


Secant 


Miiller’s 


Modified 
Newton’s 


Aitken’s 


Advantages Disadvantages 


simple and reliable; 
converges provided that 
f(a) f(b) < 0 and f is 


continuous 


simple; is a useful 
theoretical tool for 
analyzing other 
methods 


generally quadratically 
convergent 


Often quadratically 
convergent; provides 
mathematically 
rigorous proof of 
existence and 
uniqueness, and 
provides 
mathematically 
rigorous bounds on the 
error 


generally converges 
superlinearly; f’(a) not 
required 


often converges 
superlinearly; useful for 
finding complex roots 


quadratically 
convergent for multiple 
roots 


simple method that 
speeds up linearly 
convergent sequences 


not rapidly convergent 


If g/(z) £0 then 
linearly convergent. 
Also, may be difficult 
to find a good g. 
Generally, requires 
g:G—Gandga 
contraction. 


f'(a) is required; may 
be only convergent in a 
small interval about z 


Requires interval 
arithmetic; convergence 
radius may be smaller 
than the point Newton 
method 


may only have a small 
interval of convergence 


may only have a small 
interval of convergence 


f'(x) and f’"(x) are 
required 


sequence must be 
linearly convergent 


Other Comments 


is connected to other 
methods and theory 
throughout numerical 
analysis, analysis, 
topology, and 
differential equations 


special fixed-point 
scheme with 

g(x) = «— f(x)/f' (a). 
(Note that g’(z) = 0.) 


When combined with 
intersection of the 
previous iterate, 
convergence may occur 
when the point Newton 
method diverges. 


not a fixed-point 
scheme 


not a fixed-point 
scheme 


special fixed-point 
scheme (Newton’s 
method applied to 


g(x) = f(x)/f'(x)) 


Chapter 3 


Numerical Linear Algebra 


Numerical linear algebra is primarily concerned with two important subjects: 
1. efficient solution of linear systems, and 
2. computation of eigenvalues and eigenvectors of matrices. 


Numerical solution of nonlinear systems, partial differential equations, integral 
equations, etc., generally involve the solution of linear systems. Eigenvalue- 
eigenvector computation occurs in physical and biological applications. Also, 
for example, solution of systems of differential equations can involve eigenvalue- 
eigenvector computation. 

This chapter deals primarily with efficient solution of linear systems, while 
eigenvalues and eigenvectors are treated in Chapter 5. Some good reference 
texts are for numerical linear algebra are [34, 37, 40, 68, 78, 85, 97, 103]. 


3.1 Basic Results from Linear Algebra 


Here, we give a brief review of matrix algebra.! 


DEFINITION 3.1 = Let R"(C") be the n-dimensional space of n-tuples 
of real (complex) numbers, i.e., if « € C” then x = [x1,22,...,@n]" where 
a, €C fori =1,2,---,n. 


DEFINITION 3.2 x? = (11, %2,--+ ,¢n], v = [%1,%o,°-+ , En], where 
T and H refer to transpose and conjugate transpose. (If x = a+ ib, then 
==a-— ib.) 

REMARK 3.1 x“ zx is a nonnegative number whereas rx" is an n x n 
matrix. 


1This is not meant to be a self-contained introduction, but introduces our notation, and 
serves as a quick reference. It is assumed that the reader already knows the basic concepts. 
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DEFINITION 3.3 The set of real (complex) n x m matrices is denoted 
by L(R™,R”) (L(C™,C”)) or L(R”) (L(C")) ifm = n. The elements of 
matrix A will be written aj;, and we sometimes write A = (aij). We also 
may sometimes say A € R™*” to mean that A is a realm by n matrix, and 
AeECc™*" to mean that A is a compler m by n matriz. 


DEFINITION 3.4 If A= (ai;), then A? = (a;;) and A! = (G;;) denote 
the transpose and conjugate transpose of A, respectively. 


REMARK 3.2. (AB)” = B¥ A, (AB)? = BTA? and if Aismxn 
then A’ or A# is n x m. 


DEFINITION 3.5 If A is anm x n matrix and B is an n x p matriz, 


then C = AB where a 
cig = Sands 
k=1 


fori=1,---,m,7=1,---,p. Thus, C is anm x p matria. 


DEFINITION 3.6 If A is ann xn matrix (A € L(C”)), then a scalar X 
and a nonzero x are an eigenvalue and eigenvector of A if Ax = Ax. 


DEFINITION 3.7 Suppose A € L(R") or A € L(C”). Aq! is the inverse 
of Aif A~1A = AA! =T, where I is the n by n identity matrix, consisting of 
1’s on the diagonal and 0’s in all off-diagonal elements. If A has an inverse, 
then A is said to be nonsingular or invertible. 


DEFINITION 3.8 = The rank of a matrix A, rank(A), is the mazimum 
number of linearly independent rows it possesses. It can be shown that this is 
the same as the maximum number of linearly independent columns. If A is an 
m by n matriz and rank(A) = min{m,n}, then A is said to be of full rank. 


For example, ifm <n and the rows of A are linearly independent, then A is 
of full rank. 


The following theorem deals with rank, nonsingularity, and solutions to 
systems of equations. 


THEOREM 3.1 
Let A be ann x n matrix (A € L(C")). Then the following are equivalent: 


1. A is nonsingular. 


2. det(A) £0, where det(A) is the determinant of the matria A. 
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3. The linear system Ax = 0 has only the solution x = 0. 
4. For anyb €C”, the linear system Ax = b has a unique solution. 


5. The columns (and rows) of A are linearly independent, that is, if a1, 
a2, ..., Gn are the columns of A and ey Bia; = 0, then B; = O for 
¢=1,2,---,n. (That is, rank(A) = n.) 


REMARK 3.3 _ By Definition 3.6 and Theorem 3.1, A is an eigenvalue of 
A if and only if det(A — AI) = 0. The equation det(A — AI) = 0 is called the 
characteristic equation of A. 


DEFINITION 3.9  p(A) = max |Ai|, where {A;}%_, is the set of eigen- 
values of A, is called the spectral radius of A. 


DEFINITION 3.10 If A’ = A, then A is said to be symmetric. If 
A# — A, then A is said to be Hermitian. 


Example 3.1 


1 2-i 1 2-% 
If A= , then A? = , so A is Hermitian. 
2+% 3 2+% 3 


DEFINITION 3.11 For A € L(R"), if AT = A and x7 Az > 0 for 
any x € R” except x = 0, then A is said to be symmetric positive definite. 
For A € L(C"), if A? = A and x¥ Ax > 0 for x € C", « £0, then A is 
said to be Hermitian positive definite.? Similarly, if <7 Ax > 0 (for a real 
matrix A) or 2 Ax > 0 (for a complex matrix A) for every x # 0, we say 
that A is symmetric positive semi-definite or Hermitian positive semi-definite, 
respectively. 


Example 3.2 


If A= 6 5) , then A? = A, so A is symmetric. 


2A matrix need not be symmetric or Hermitian to be positive definite or positive semi- 
definite, although that is the usual context. That is, A is positive definite provided 2? Ax > 
0 for every x # 0, and positive semi-definite provided x7 Ax > 0 for every x # 0. 


88 Classical and Modern Numerical Analysis 


Also, a? Ax = 4x? + 2aj 22 + 303% = 3x? + (a1 + 22)? + 2x2 > 0 for c £ 0. 
Thus, A is symmetric positive definite. 


PROPOSITION 3.1 
If A is symmetric positive definite or Hermitian positive definite, then its 
eigenvalues are real and positive. 


PROOF Suppose that Ar = Ax. Then a” Aa = dx". Also, (a™ Aax)¥ = 
(Av x), which yields 2 Az = \x" a. Thus, \27%2 = dx" a. Hence \= A, 
so \ is real. Also, \= a2" Ar/x™ x > 0. 


Now consider a linear system of equations Ax = b, where A is n x n, and 
b,x € R”. 


DEFINITION 3.12 = Elementary row operations on a system of linear 
equations are of the following three types: 


1. interchanging two equations, 
2. multiplying an equation by a nonzero number, 


3. adding to one equation a scalar multiple of another equation. 


THEOREM 3.2 
If system Ba = d is obtained from system Ax = b by a finite sequence of 
elementary operations, then the two systems have the same solutions. 


A proof of Theorem 3.2 can be found in elementary texts on linear alge- 
bra and can be done, for example, with Theorem 3.1 and using elementary 
properties of determinants. 


3.2. Normed Linear Spaces 


We will use these concepts both in this chapter and in subsequent chapters. 


DEFINITION 3.13 Let V be a vector space (recall that a vector space 
is closed under addition and scalar multiplication) over the field of real or 
complex numbers. V is called a normed vector space if to each u € V a 
nonnegative number ||ul|, called the norm of u, is assigned with the following 
properties: 
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1. |lul] > 0. 
2. |\u|| = 0 if and only if u = 0. 


3. ||Xu|| = |Al||ul| for A € R (or A € C if V is a complex vector space). 


4. ||u+v|| < |lul] + |u|] (triangle inequality). 


DEFINITION 3.14 _ W is a subspace of a real vector space V if ue W, 
v © W imply that au + Bv € W for alla, BER. 


DEFINITION 3.15 Let V be a vector space. Then uy,ug,...,Un € V 


n 
are linearly independent if >> aju; = 0 implies that ay =ag =---= Qn, = 0. 
i=l 


Example 3.3 

Let V = C{a, b], the space of continuous functions on an interval [a,b]. Then 
u, = 1, ug = @ are linearly independent, while u; = 1, wz= 2, u3 = 2—-2 are 
linearly dependent. 


DEFINITION 3.16 — Let uj, u2,...,Un € V. The set of all linear combi- 


nations of U,,U2,.--,Un ts called the span of u1,U2,...,Un- 


REMARK 3.4 Let W = span(u1, u2,..., Un). It is easy to show that W 


is a subspace of V. (w € W has the form w = > cu.) 
i=1 


DEFINITION 3.17 If V = span (uy, ta,...,Un) and uy, U2,...,Un are 
linearly independent, then uj, U2,...,Un forms a basis for V. 


REMARK 3.5 Suppose that V has a basis of n elements. Then, ev- 
ery basis of V has n elements. Any collection of n + 1 elements is linearly 
dependent. 


DEFINITION 3.18 — If a vector space V has a basis with a finite number 
n of elements, then n is called the dimension of the vector space. 


Example 3.4 
Let P? represent the set of polynomials of degree 2 or less. Then P? is a 
subspace of V = Cla,b]. P? = span(1,x,x7), and, since 1, x, and x? are 


linearly independent, the dimension of P? is 3. 
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Consider V = C”, the vector space of n-tuples of complex numbers. Note 

that x € C” has the form x = (21, 22,°*: , tn)". Also, 
ety = (ti + 91,02 + Ya.--: tn + Yn)” 
and 
A® = (Ax, AX2,°°° At) 

Important norms on C” are: 

(a) |lz|]oo = max |a;|: the @. or max norm (for z = a+ ib, |z| = 

l<i<n 
Va? + b? = V27.) 


n 


(b) |la|]a = oy |z;|: the €; norm 
i=l 
1/2 
(c) |lz|lo = (> i] : the 2 norm (Euclidean norm) 
i=l 


(d) 


8 


n 1/p 
a Ds et") : the €, norm 
i=l 


REMARK 3.6 _ (b) and (c) are special cases of (d), and it can be shown 


% 1/p 
that ||a||.o = lim (>: it) U 
pro 


i=1 


REMARK 3.7 _ It is not hard to see that (1), (2), and (3) of Definition 3.13 
are satisfied for the above norms. The triangle inequality (4) of Definition 3.13 
for (a) and (b) follow from |x; + y:| < |x| + |y:|, €-g., 


n 


n n 
lle + ylla = So es + yal SSO bea + SO [yal = Mlerlla + lylhs. 
i=l i=l 


i=l 


0 


The triangle inequality, or Minkowski’s inequality, is not proved here for 
p-norms since we won’t use general p-norms. The triangle inequality for the 
Euclidean norm follows from: 


THEOREM 3.3 
(the Cauchy-Schwarz inequality) 


nm 
| 5 LiY; 
i=1 


S |lallellylle- 
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PROOF Let |z| and |y| be vectors with components |x,;| and |y;|. Then 
for 0 ER, 


n n n 
0 < [|x| + yl 2 = 0? do le? +26 D0 lesllosl + D0 lyl?. 
j=l j=l j=l 


The quadratic polynomial in @ on the right-hand side does not change sign. 
Note that if p(0) = a6? +b0+c > 0 then the discriminant b? —4ac < 0. Thus, 


2 
n n 


n 
Ss lela | Se lel WP: 
j=l j 


j=l j=1 
But 
2 
n 2 n n n 
oes] < (Soleil) < Clea Do bP, 
j= j=l j= j=l 
that is, 
n 
| >> 299,] < Welle. 
j= 
U 
The triangle inequality for || - ||z2 now follows from the Cauchy—Schwarz 


inequality as follows: 


1/2 
z+ ylle = | Soe +9) +B) 
j=l 
1/2 
= [do leyl? + > (eyo; + B95) + ly? 
j=l j=l j=l 
1/2 
< | llel3 +2] > 2,9;| + lvl 
j=l 
2 2) 1/2 
< ([[zll3 + 2llellellylla + Wyld) °~ = Melle + llylle- 


An important type of normed space is an inner product space: 


DEFINITION 3.19 An inner product on C” x C” is a complex-valued 
function, denoted by (-,-) defined on all pairs x,y € C” such that 


(a) (a,x) > 0 and (a,x) = 0 if and only if = 0, 
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(b) (@,y) = (y,#), 
(c) (a, Ay + wz) = X(2,y) + Ha, z) for A,weE C. 


REMARK 3.8 (Az, y) = (y, At) = X(z, y). 0 


REMARK 3.9 If |{a|| = (2, 2)!/?, then || - || is a norm. Thus, any inner 
product space is also a normed vector space. 


Example 3.5 
The following define inner products. 


1. (x,y) = (x,y) = > ay. Clearly, (2x) = |la|[3. Also, (x,y) = y¥a 
i=l 
where y” = (9, 92,-** »Jn)- 


n n 
2. (z,y) = S- ye ajhyig; = (Hx, y) = y” Hx where H is an n x n Hermi- 
j=l i=1 
tian positive definite matrix. 


0 


We now introduce a concept and notation for describing errors in computa- 
tions involving vectors. 


DEFINITION 3.20 The distance from wu to v is defined as ||u — v]]. 


The following concept is also fundamental to inner product spaces. In 
particular, we will use it heavily in Section 3.3.8 and Section 4.2.2. 


DEFINITION 3.21 Let (.,.) represent an inner product. Two vectors u 
and v are called orthogonal with respect to (.,.) provided (u,v) = 0. A set of 
vectors v™ is said to be orthonormal, provided (v® , y) = 6ij, where di; is 
the Kronecker delta function 


1 ifi=J, 
ij = heen 
0 ifiFy. 
To analyze iterative techniques involving vectors and matrices, we use: 


DEFINITION 3.22 A sequence of vectors {x*}°°, is said to converge 
to a vector x € C” if and only if ||z* — x|| + 0 as k — 00 for some norm || - ||. 
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REMARK 3.10 Definition 3.22 implies that a sequence of vectors {x"} 
converges to x if and only if? x* — x; as k > oo for all i. (We will discuss 
this further later.) 


Example 3.6 
Let x* = [1*,(5)*, (4)*, (¢)*]". Then x* > x = [1,0,0,0]” in the @..- norm, 
since 


1\* 
t-al= (5) —O0ask— oo. 


DEFINITION 3.23 = Two norms ||-||~ and ||-||g are called equivalent if 
there exist positive constants cy and cp and such that 


8 


cillella < |lalle < cal 


lla 


Hence, also, 


* IIelle < llelle < =I 
—||x tla < —||2]|/¢. 
C2 es C1 8 


REMARK 3.11 _ If the norms ||- ||, and ||- || are equivalent, then x* > x 
in norm || - ||q if and only if 2* — x in norm || - ||¢. 


THEOREM 3.4 


Any two norms on C” are equivalent. 


REMARK 3.12 By Theorem 3.4, convergence in any norm in C” thus 
implies convergence in any other norm. 


PROOF (of Theorem 3.4) We will prove that any norm is equivalent to 
the @2 norm. Then any two norms are equivalent. That is, if 


c1||z\|2 < ||zI|¢ < callz|l2 
and 
cilla|l2 < [le llo < ellel2, 
then a a 
= |lzlla < |lalle < =llelle. 
2 Cy 


3Here, we are considering only finite-dimensional vector spaces. 
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Let || - || be any vector norm. Let ej, €2,---e, be the usual basis vectors for 
Cc", that is, ey = (45, 525, ve ery ae j = 1,2, cee yn, where 


Then 


Thus, 


Nile 
ix) 


n n 
lll < So lesl les < | do lea? de llesl? 
j=l j=l 


by the Cauchy—Schwarz inequality. Thus, 


n 
\|z|| < ya|lall2, where y = S- lle, |I? 
j=l 


Let h(x) = |||, ie., h: C” — R. Notice that 


a(x) — A(y)| = lll] — llyll| < le - yl $ ville — glle. 


Let S = {2 € C”: ||z|2 = 1}, ie., S is the surface of the unit ball in C”. The 
unit spherical surface S$ is closed and bounded, and h is a continuous function 
on S. By a classical theorem of topology or analysis,+ h has a minimum at 
some z in S. Let yo = h(z) < A(y) for all y € S. (Notice that yo > 0 
since if yo = 0 then h(z) = ||z|| = 0 implies z = 0, which is impossible since 
Ize =1.) Now y = 2/lll2 € S. Thus, 0 <0 < A(e/[lzll2) = [lell/llallo 
Hence, Y||2\|2 < ||2||. Combining this with our earlier result ||x|| < y||2||2 
gives yollell2 < lixll < millallo. 


REMARK 3.13 Recall that earlier we claimed that ||z — x*|| > 0 as 
k — oo if and only if xk — 2;,1<i< 7. This is now obvious, since there 
exist constants c; and cy such that c||z — x*||o < ||x — x*|| < ca|lx — 2* Ilo, 
keeping in mind that ||a — 2* ||, = maxi<i<n |z; — x} |. 


Consider a specific case of Theorem 3.4: 


PROPOSITION 3.2 
For each x €C", |[aloo < |lall2 < Viilletloo- 


4The theorem states that a continuous function on a compact set attains its minimum and 
a maximum at points on that set. 
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PROOF Let x; be the element of x such that 


ellxo = max. || = [25 
Then 
n 
lelle, = kesl? < 0 esl? = lle le. 
i=1 
Also, 
n n 
les = So lea? < So ley)? = ales)? = nllolle. 
i=1 i=1 
Thus, ||2lloo < ||2Il2 < Vn |2lloo. l 


We now consider norms for n x n complex matrices. In the following, A 
and B are arbitrary square matrices and \ is a complex number. 


DEFINITION 3.24 A matrix norm is a real-valued function of A, de- 
noted by || - || satisfying: 


1. ||Al] = 0. 

2. \|Al| =0 if and only if A=0. 
3. ||AAll = [A] |All 

4. \||A+ Bl < |All + BI 

5. ||AB|| < || Al] Bll. 


REMARK 3.14 In contrast to vector norms, we have an additional fifth 
property, referred to as a submultiplicative property, dealing with the norm 
of the product of two matrices. 


The following will be used in our analysis of iterative methods for solving 
systems of equations, and elsewhere. 


PROPOSITION 3.3 
If || - || ts any matriz norm as in Definition 3.24, 
p(A) < |All. 


PROOF Let 
B= (a#,2,4,...,2), 
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that is, let B be the n by n matrix, each of whose columns is the nonzero 
vector x, where x is such that Ax = Ax. Then 


AB = (Aa,... Ax) = AB. 


Thus, by the third and fifth properties of matrix norms, |A| || Bl] < ||Al] || Bll, 
so |A| < ||Al|| for any eigenvalue of A. Since the spectral radius is the maximum 
eigenvalue of A, the result follows. 


REMARK 3.15 By choosing a particular ordering of the elements, n x n 
matrices may be viewed as vectors in ce’, e.g., de = ajr, = 7 +n(k—1) for 
j=1,2,---,n and k =1,2,---,n. Thus, if we define a matrix norm using a 
vector norm on the n? vector, the first four conditions in the above definition 
are automatically satisfied. However, in general, condition (5) will not hold. 
For example, consider ||.A||.. = max;,; |ai;|, and take 


11 
iene): 


Then ||ABllo £ ||Alloo ||B]]o.. One norm for which this procedure holds is 


the Euclidean norm 
1 
3 
n 


|Allz =| S> lal? |, 


aj=l 


which is also called the Frobenius norm. One sees that (1), (2), (3), and (4) 
of Definition 3.24 are satisfied by considering the corresponding ¢2-norm || - ||2 
of the corresponding vector in C”’. To prove (5), we have 


n n 2 n n n 
ABE = 0 ( cad SSO lai? So loagl? = |AMZIBIle- 
al k=1 


ij=l \k= i,j=l k=1 


DEFINITION 3.25 = A matrix norm ||Al| and a vector norm ||a|| are called 
compatible if for all vectors x and matrices A we have ||Az|| < ||Al] ||z||. 


REMARK 3.16 A consequence of the Cauchy—Schwarz inequality is that 
|| Az|lo < ||Allz]|z|l2, i-e., the Euclidean norm || - || for matrices is compatible 
with the £:-norm || - ||2 for vectors. 


We now define some commonly used matrix norms. 
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DEFINITION 3.26 — Given a vector norm || - ||, we define a natural or 
induced matrix norm associated with it as 


we 


(3.1) 


REMARK 3.17 | If an induced matrix norm is compatible with the given 
vector norm, then 
|| Al] [||| > || Aa] for all a € C”. (3.2) 


0 


REMARK 3.18 It is straightforward to show that an induced matrix 
norm satisfies the five properties required of a matrix norm so is indeed a 
matrix norm. In particular, 


peel | |All | Bz] 
| AB|| = < sup —__— = |All |B), 

zl ~ ego Mell 
since || ABza]| = ||Az|| < || Al] |l2|)- U 
REMARK 3.19 Definition 3.26 is equivalent to 

| Al| = sup ||Ay], 

llyll= 

since Ac 

Xx x 

Al =sup Et — sup a =| = sup (140 
«#0 |lZl]  e¢oll U[ell|]— qyy=s 

(letting y = x/|)z\\). 


REMARK 3.20 We shall use matrix norms ||-||o0, ||+||1, and ||-||2 induced 
by the corresponding vector norms. That is, 


Az|| 56 A 
lI] 20 ~ eo [alla 


plat lee 
720 [ella 


d_ ||All2 = su 


|| Allo = sup 
2X0 


REMARK 3.21 The Frobenius norm ||-|| z is not a natural norm, because 
\|Z|| = 1 for any natural norm but ||J||~ = .//n. Consider briefly the relation 
between ||Al|2 and ||Al|z. First note that, in general, ||All2 4 ||Al|z. (For 
example, let A = J.) Also, 

wlAsile < . lAllelll 
eyo [ella ~ ego (latll2 


|| All2 = = ||Allz 
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Thus, ||All2 < ||Allz- [ 


We now develop explicit expressions for || Allo, ||Al|1, and || Alla. 


PROPOSITION 3.4 
(Formulas for the induced €, and l,, matrix norms) 


n 
(a) ||Alloo = max S- |a;;| = {maximum absolute row sum}. 
l<i<n¢ 
Ss 
n 
(b) \|Alla = max S- |a;;| = {maximum absolute column sum}. 
Sjsn¢ 
i=l 


PROOF 
n n n 
| Azloo = max |S aije3] < max 7 laijllej] < max S7 laijlllallo: 
eo. j=l as ll eS Bat 


Thus, 


n 


Allo $ max | DJ lal } (3.3) 


j=l 


since 
| AZ|] 0 


n 
< ZAR 
[elo ame 2 los 
j=1 


Now let & be the index of the row which has the maximum absolute row sum, 


ie., 
n n 
» lang] = max (7 |aij|), 
j=l mee i 
and let y be defined by 


—_ J Gj /lanj| if aaj #0, 
ay 0 if agy =0. 


Hence, ||y||o0 = 1 (if A #0). Also, 


|| Aylloc = max | So aigys (3.4) 


= 


n n n 
2 | >o axsys| = S- lang] = max >) |axj|. 
j=l j=l SST jel 
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Since, from the definition of || Allo, ||Alloo > ||Ay|loo > 0, (3.4) implies 
Alle = ax, Yo leul (3.5) 
jJ= 


Combining (3.3) and (3.5) gives (a). 
To prove (b), consider 


n n n n n 
Ars = >| 0 aigas] < 7ST laasllesl < max, YO Jassie. 
i=l j=1 == i=l 


i=1 j=l 


(Recall ||x||, = 5 |x;|.) Thus, 


j=l 


|All: < max. (>: en!) : (3.6) 


Let k be such that ye |aix| = aio” |aij|) and let y = e,, where (ex); = dx3- 


a 

Then ||y||; = 1, and 
n n n n N n 

[aah = 32] Soo] = So] So autin| = So loal = ams (Sofa) 
i=1 j=l i=1 j=l i=1 IS \i=l 


However, from the definition of || Alli, ||A]|1 > || Ay||1. Thus, 
|All 2 max() |aij). (3.7) 
i=1 


Hence, (b) follows from (3.7) and (3.6). 


Now consider ||Al|2. Recall that || All2 4 ||Al|z. Also, recall the following. 
(i) p(A) = max |),(A)| = {spectral radius of matrix A} (the eigenvalue of 
A with the largest magnitude). 


(ii) A Hermitian matrix B (B” = B) has real eigenvalues and a linearly 
independent set of eigenvectors which are orthonormal with respect to 
the inner product (-,-), where (x,y) = ya. 


With this, we have: 


PROPOSITION 3.5 
||All2 = V/p(A* A). 
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PROOF Let x 40inC”. Then 
\|Ax||2 (Aa, Ax) (Ax)¥ Ar gt! AM Ax (A Az,2) 


Icls = @ 2) (ez) (zt) G2) 
Thus, 
Arlo ((A¥ Aa, x)\? 
llzll2 ( (x, x) 
Now A” A is Hermitian. Let v;, i = 1,2,--- ,n, be the orthonormal eigenvec- 


tors of A” A and let . 
t= S- Ci. 
i=l 


Since (144, ;) = 6;; and since 


AU Ax => 5 NC, 


i=l 


it follows that 


n 


(x,2) = |e? 


i=1 
and 


(Ax, Ax) = (A® Az, x) => leil?, 


where \;, 1 < i < n, are the eigenvalues of A” A. Now \; = (AM, Ar;) > 0, 


so 
So rales? 
|| Az'||o = Eo (max ;)? = /p(A# A). (3.8) 


Il ll2 » lel? 


Thus, || All2 < //p(A# A). Also, letting « = vz where Ax = max A; = p(A” A), 
it follows that 


zh 


2 


|| Avelle 
IIvell2 


Combining the inequalities (3.8) and (3.9), we obtain || Al]2 = /p(A7 A). O 


||Alla > = (A¥ Avy, vg)? =A? = / (AFA). (3.9) 


It is interesting in view of Proposition 3.5 to revisit Proposition 3.3, namely, 
that p(A) < ||A|| for any square matrix A and any matrix norm. An alternate 
proof of this proposition for induced matrix norms can be based on quotients 
|| Az||/||z||, where x is an arbitrary eigenvector of A. It is also interesting 
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to note that the difference between ||A|| and p(A) may be arbitrarily large. 


Consider 
0 2 
he & a 


Then p(A) = 0 but || Al]; = 2. However, the following result shows that there 
is a matrix norm arbitrarily close to p(A). 


PROPOSITION 3.6 
Let A be a givenn x n matriz and let € > 0. Then there exists an induced 
matriz norm || - || such that ||A|| < p(A) +. 


PROOF = The proof rests on the result in linear algebra that any matrix 
is similar to an upper triangular matrix, i.e., given any matrix A, there exists 
a nonsingular matrix P such that PAP-' = T = A+U where T is upper 
triangular. Furthermore, the diagonal entries of T are the eigenvalues of A, 
Le, if A = diag [A1, A2,--- , An] then U has zero’s on and below the diagonal. 
(This result is known as Schur’s Theorem, and the decomposition A = P~!'TP 
is known as the Schur decomposition.) With 6 > 0, define 


De" = diagl10,07 i406" 1: 
Then C = DTD-!=A+E where E = (e;;) = DUD~! has elements 
J 
pie 0 if 7 <4, 
I) i698 if GF >a. 


Hence, the elements of E can be made arbitrarily small by choosing 6 small 
enough. Also, note that A = P-!D-'CDP. Since DP is nonsingular, a 
vector norm can be defined by 


||¢|| = ||DPall2 = (@? P?D® DP2)?. 
The matrix norm induced by this vector norm is 


|Al| = sup ||Ayl. 
lyll=1 


For the particular A above, 
|| Ayl| = || DPAyll2 = ||CD Pyle. 
Letting z = DPy, we have || Ay]| = ||C2z||2 = (z#C#Cz)?. Put 
CMC = (A" + E")(A+ E) =A" A+ M(6), 


where M(6) is an n x n matrix whose elements are order 4, i.e., |mij;/d| < k 
for 6 sufficiently small. Thus, 


|| Ay|| = z#C#Cz < max |A2(A)|z#z + |z# M(6)z| < (p*(A) + O(6))z%z. 
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Since z = DPy and |ly|| = 1, we have ||y|| = ||DPyll2 = ||zll2 = 1, that is, 


z4z=1. Therefore, 


|| All < [?(A) + O(8)]? < p(A) + (O(8))?. 


i) 


For 6 sufficiently small, we can make (O(5))? < e. 


The following is now worth noting. 


PROPOSITION 3.7 
All matrix norms are equivalent, that is, given any two matrix norms || - |la 
and ||-||g there exist positive constants c, and cy such that 


cillAlla < Alls < eallAlla- 


PROOF By particular ordering of the elements of A, A can be viewed as 
a vector in C”. Thus, a matrix norm of A can be viewed as a vector norm 
in C”’. Since any two vector norms are equivalent, any two matrix norms are 
equivalent. 


Note: Propositions 3.3, 3.6, and 3.7 give p(A) < ||A|| < cp(A) + where || - || 
is any matrix norm, c is a positive constant that depends on the norm || - || 
and matrix A, and p(A) is the spectral radius of A. However, given A and 
€ > 0, there is a norm || - ||’ such that p(A) < ||A||’ < p(A) +. 


We now consider sequences of matrices. 


DEFINITION 3.27 {A}? converges to matrix A if and only if ||Ax— 


Al| = 0 as k > oo for some matrix norm. 


Note: The choice of norm is not important from a theoretical point of view, 
since all matrix norms are equivalent. Thus, one should use the norm most 
suitable for the particular problem. 

Note: It is clear that if || A, — A|| ~ 0 as k > oo, then oy — aij as k > co 
for each 7, j. 


We now consider the sequence { A*}9, i.e., elements of the sequence are 
powers of A. We have the following convergence result: 


THEOREM 3.5 
Given ann x n matrix A, the following are equivalent. 


(a) lim A*=0 (i.e., lim ||A*|| = 0). 
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(b) jim A*®z =0, Vz € C” (ie., jim || A* «|| = 0). 
(o) pA) <1. 


(d) there exists a norm || - || such that ||Al| <1. 


REMARK 3.22 A matrix for which (a) holds is called a convergent 
matrix. U 


PROOF We will prove that (a) > (b) => (c) = (d) = (a). 

(a) = (b): We have ||A*z|| < || A*||||x||. Thus, jim || A*x|| = 0. 

(b) = (c): Let v be an eigenvector of A with associated eigenvalue >. By 
(b), lim | A*v|| = Jim |A|*||u|| = 0. This implies that |A| < 1. Since 
this holds for any eigenvalue of A, p(A) < 1. 


(c) = (d): we choose for p(A) < 1,€ such that 0 < «€ < 1— (A). Then, by 
Proposition 3.6, || Al] < p(A) +e <1. 


(d) => (a): we have ||.A*|| < || A||*. Thus, jim | A*|| < jim || Al|* = 0, since 
\| Al] < 1. 


REMARK 3.23 In condition (d), there exists should be stressed. Con- 


sider 
0 2 
A=(} al 


I|Alloo = |] Alla = |]All2 = |Allz = 2 > 1, but A1(A) = A2(A) = 0, so p(A) = 0. 
Thus, limp... A* = 0. In fact, A* = 0 for k > 2. 


Two more useful results are now given. 


PROPOSITION 3.8 


The geometric series 
Co 
SUAS = I+ A+ A+ AB He: 
k=0 


converges to a certain matrix if and only if A’ +0 as k — oo. Furthermore, 
if A¥ +0 ask — ow, then I — A is nonsingular and 


(r—A)'= wu 
k=0 
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PROOF Let A* — 0. By Theorem 3.5, p(A) < 1. Consider B = I — A. 
Then A(B) = 1 — X(A), where X(B) is any eigenvalue of B and (A) is the 
corresponding eigenvalue of A. Thus, \(B) 4 0 since |\(A)| < 1. Hence, [—A 
is nonsingular because no eigenvalues are zero. Now consider the identity 


(EAE pA PAST Ae, 


This implies 
(I+ A+---4A*) =(1-— A) =A), 


that is, 


kc ¢ k-00 


k 
lim 57 A¥ = (I-A)! lim (I - A**1) = (I-A). 
j=0 


Thus, 
(I-A) =S0 A’. 
j=0 


Conversely, let }77°.9 A* be a convergent series. Then 
a 
AF= 57 AP- S57 AI 0 
j=0 j=0 


as k — oo. That is, a necessary condition for convergence of the series is 
limps Ak = 0. 


PROPOSITION 3.9 


Let || - || be a vector norm and ||A|| its induced matrix norm, i.e., 


| sap Al, 
SP “Tal 


If ||Al| <1, then (I — A) is nonsingular and 
1 


<||@=4)- 5 


1 
1+ |All ~ 1- (Al 


PROOF By Theorem 3.5 and Proposition 3.8, (J — A) is nonsingular. We 
also have ||I|| = 1, since || - || is an induced matrix norm. Thus, 


1 = |lZ|| = |(2- A)" —A)|| $ I-A) IZ All S$ I-A) + IAI). 


Hence, 1/(1 + || All) < || — 4)~* I 
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Also, (I — A)~! = I+ A(I— A)7!. Thus, 
(2 — A)* |] $1 + ANNA - A, 
that is, 
(1 — [Al (Z- A)“* I <1. 
Hence, ||(J — A)~*|| < 1/(1 — |All). l 
Note: This concludes, for now, the review of linear algebra. More results, 


such as for eigenvalues and eigenvectors and for irreducible matrices, will be 
given later in this chapter or in Chapter 5. 


3.3. Direct Methods for Solving Linear Systems 


We discuss elimination and factorization methods for solving linear systems 
in this section. We begin with Gaussian elimination. 


3.3.1 Gaussian Elimination 


We first present our notation. 


3.3.1.1 Statement of the Problem 


Given a nonsingular complex n x n matrix A and b € C”, find « € C” 
such that Ar = b. That is, if A = (a;;) and b = (b1,--+ ,bn)? find « = 
(1, @2,°*: ,2n)? such that 


Q11%1 1+ AyQ%Q7T 8° TAlnIn = b1, 
a21%1 + A222 ttt TP EAQntn = be, 
Ani X1 + Ang%gt +++ +4Anntyn = bp. 


For a general matrix A, the goal of Gaussian elimination is to reduce A to an 
upper triangular matrix through a sequence of elementary row operations.° 
We will see that this is equivalent to finding an invertible matrix M such that 
MA = U, where U is upper triangular, so MAx = Mb, that is, Ux = Mb. 


5Recall that row operations involve interchanging two rows and replacement of a row by 
the sum of that row and a scalar multiple of another row; see Definition 3.12 on page 88. 
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Suppose that 


of? of ol i 
ya| 9 a an | ona Moe by” 
: os 0 af”) nr) 


(The use of the index (n) will become clear shortly.) Then Uz = Mb can be 
rapidly solved by a process called backsolving or back substitution. Specifically, 


Bn = ORY fan 


ay = (of? - Dy aay) | a fork =n—1,n—2,---,1. 
jH=k+1 


3.3.1.2 The Gaussian Elimination Procedure 


Recall that we are trying to solve Ax = b, where A is nonsingular. For uni- 
formity of notation, let A= A®) = (al) and b = b) = (bf, BS, --- BP), 
so that Aa = 6). 


Step 1: Assume that al) # 0. (Otherwise, the nonsingularity of A guaran- 
tees that the rows of A can be interchanged in such a way that the new 


alt) is nonzero.) Let 


Now multiply the first equation of A“) 2 = 6) by m4 and subtract the 
result from the i-th equation. Repeat this for each 1,2 <<i<n. Asa 
result, we obtain A®@)a = b@), where 


at? al? aly, ot) 
(2) (2) (2) 

AQ) = 0 G59 wes Oop ey SO ace b5 
0 a) ik al?) p{?) 


A(?) is invertible, since row operations do not affect | det(A)|, ie., 


| det(A®))| = | det(A™)]. 


Step 2: We consider the (n — 1) x (n — 1) submatrix A) of A®) defined 
by A@) = a, 2 < i,j <n. We eliminate the first column of A®) 
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in a manner identical to the procedure for A“). The result is system 
A®)z = b@) where A®) has the form 


1) 1 

ayy ie “— ay 

O ay as 

AM =] 2 9 a... a® 
0 0 a vee a?) 


Steps 3 to n—1: The process continues as above, where at the k-th stage 
we have A\*)x = 0), 1<k<n-—1, where 


1 
of a? “0 
0 of -- Ei io 
0 a33 
AQ) = phy» oe and 6") = | bey? 
0 al aft fc 
Ge fen Oh Gee” dense ig ps*) 
(3.10) 


For every i, k+1<i<n, the k-th equation is multiplied by 


ma = al el 


and subtracted from the i-th equation. (We assume, if necessary, a row 


is interchanged so that alt) #0.) After step k = n —1, the resulting 
system is A(a = b(™ where A) is upper triangular. 


On a computer, this algorithm can be programmed as: 


ALGORITHM 3.1 
(Gaussian elimination, forward phase) 


INPUT: A € L(R”) and b € R”. 
OUTPUT: A™ and b™, 
FOR k =1,2,---,n-1 


FORi=k+1,---,n 


(a) Mik — Gik/Okk- 
(b) FOR j =k, k+1,--+,n 
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Aig — Ajj — Mikak;- 
END FOR 
(c) b; —_— b; i MipOk- 


END FOR 


END FOR 
END ALGORITHM 3.1. 


Note: In Step (b) of Algorithm 3.1, we need only do the loop for 7 = k+1 
to n, since we know that the resulting az+1,, will equal 0. 


Back solving can be programmed as: 


ALGORITHM 3.2 
(Gaussian elimination, back solving phase) 


INPUT: A™ and b™ from Algorithm 3.1. 
OUTPUT: x € R” asa solution to Ar = b. 


1. ty, — bn /Onn- 


2 FORk=n-1,n-2,---,1 


n 


Lk — (by, = S- Anj%5) /Akk- 


j=kt+1 
END FOR 
END ALGORITHM 3.2. 
Note: To solve Ax = 6 using Gaussian elimination requires 3n° + O(n?) 
multiplications and divisions. (See Exercise 18 on page 182.) 
3.3.1.3 The LU decomposition 


Now let’s consider the LU decomposition (triangular decomposition). As- 
sume first that no row interchanges are performed in Gaussian elimination. 
Let 
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where [,,_1 is the (n — 1) x (n—1) identity matrix, and where m1, 2<i<n 
are defined in Gaussian elimination. Then it is straightforward to see that 


A?) = MY AM and b?) = MMM. Also, det(M™) = 1, so det(A@)) = 
det(A™). At the r-th stage of the Gaussian elimination process, 


; (3.11) 


(3.12) 


where miy,r + 1 < i < n are given in the Gaussian elimination process 
(Exercise 20 on page 182). Also, it is easy to see that ACt) = M(™ AM 
and p+) = MMp”, (Note: We are assuming here that al”) # 0 and 
no row interchanges are required.) Collecting the above results, we obtain 
AM = b™, where 


AM = MO-Dyge-2) 0. OAD ang 9M = MOD pgr-2).... eM, 
Recalling that A is upper triangular and setting A‘ =U, we have 
A= (M@-) Mr)... MM)-1y, (3.13) 


Note: The product of two lower triangular matrices is lower triangular and 
the inverse of a nonsingular lower triangular matrix is lower triangular. (Ex- 
ercise 19 on page 182) Thus, L = (M)-1(M(@))-1...(M("-))-! is lower 
triangular. Hence, A = LU, i.e., A is expressed as a product of lower and 
upper triangular matrices. The result is called the LU decomposition (also 
known as the LU factorization, triangular factorization, or triangular decom- 
position of A. The final matrices LZ and U are given by: 


1 1 1 
10 oD. off 

L= ™m31 M32 1 0 and U = 
Mni Mn2 *** Mn n-1 1 0 0 al”) 


(3.14) 
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(See Exercise 22 on page 182.) This decomposition can be so formed when 
no row interchanges are required.® Thus, the original problem Ar = 6 is 
transformed into LUx = b. This can be readily solved by forward substitution 
followed by backward substitution. That is, we first solve Ly = b for y, then 
solve Ux = y for x. This is a relatively fast procedure for large n (O(n?) 
as opposed to O(n?) for initially factoring A) and is especially useful when 
solutions are required for many different b’s for the same matrix A. 


REMARK 3.24 _ If A= LU, then 


det(A) = det(L) det(U) = I a 


j=l 


Hence, Gaussian elimination is an efficient procedure for computing det(A). 
(Using expansion by minors to compute the determinant requires O(n!) mul- 
tiplications. ) 


REMARK 3.25 = The inverse A~! can be rapidly found by solving n 
systems Ax; = e; where (e;); = 6;;. If A = LU, we perform n pairs of 
forward and backward solves. Then, A~! = (a1,2%2,--+,2n). (Note that 
AA7} = (Agy, Ate,-++ , An) = I.) 


Note: In some software, the multiplying factors, that is, the nonzero off- 
diagonal elements of L, are stored in the locations of corresponding entries 
of A that are made equal to zero, thus obviating the need for extra storage. 
Effectively, such software returns the elements of L and U in the same array 
that was used to store A. 


3.3.1.4 Pivoting in Gaussian Elimination 
We have two questions: 
1. When can Gaussian elimination be performed without row interchanges? 
2. If row interchanges are employed, can Gaussian elimination always be 


employed? 


THEOREM 3.6 
(Existence of an LU factorization) Assume that nxn matriz A is nonsingular. 
Then A = LU if and only if all the leading principal submatrices of A are 


6When row interchanges are required, we need to insert a permutation matrix into the 
matrix product. We will define permutation matrix and discuss this later. 
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nonsingular.’ Moreover, the LU decomposition is unique, where L has unit 
diagonal elements. 


REMARK 3.26 ‘Two important types of matrices that have nonsingu- 
lar leading principal submatrices are symmetric positive definite and strictly 
diagonally dominant, i.e., 


n 
lax | > S- laisl, for i=1,2,---,n. 
gel 


GAA 


PROOF (of Theorem 3.6) First, suppose that all leading principal sub- 
matrices of A are nonsingular. If we can show that alk) # 0 in the Gaussian 
elimination procedure for each k, 1 < k < n, then we have proved that 
A = LU. We prove this by induction. For k = 1, this is just ay, 4 0. Now 
suppose that al) # 0 for i = 1,2,---,k —1 80 we have A®), A@),... , AM) 
and MY), M@),...,M), We can write 


A® = M@-D yy -2) ... 7 A) (3.15) 
as 
AAR) fa? 0) fale? 0 
agp ag) ~ (age agen) age ane 
MY? 0 Au Ais 


MS? Mg) Ag, Age 


where Aq is kx k, Aj iskx (n—k), Agi is (n—k) xk and Agg is (n—k) x (n—k). 
In the above, Ale) is the leading principal submatrix of order k and matrices 
M&-), M(k-2)_.... M and A are partitioned accordingly. Since M are 
lower triangular, it follows from (3.15) that A‘) = MEY ee MS) An. But 
all M ©) are nonsingular, and Aj; is nonsingular by assumption. Hence, A‘) 


TRecall that the leading principal submatrices of A have the form 


141 +++ An 
an tes : fork =1,2,---,n. 
Qk1 +--+ kk 
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is nonsingular, so 


a 
det (A) =det} :°.) i |= ava? . ah) #0, 
O25. alk) 


so alt) #0. 
For the converse, suppose that A = LU. We need to show that all leading 
principal submatrices are nonsingular. We write 


Ii, 0 O11 U2 Ai Az 
ene ie 7) ( 0 a - Ge 2) , 
where Ay; is k x k and L and U are partitioned accordingly. Since A is 
nonsingular, Z and U are nonsingular and hence £1, and U;, are nonsingular. 
But then Ay; = £4,U,, is nonsingular. 

Finally, consider uniqueness. Assume that D,U, = [2U2 = A. Then 
UUs 1 = Eee. But Lg hs is lower triangular with diagonal elements unity, 
and U,U,' is upper triangular. Thus, UjU,' = Ly'L2 =I, so Ly; = Lo and 
Uy = Up. 


We now consider our second question, “If row interchanges are employed, can 


Gaussian elimination be performed for any nonsingular A?” The following 
definition will be useful. 


DEFINITION 3.28 A permutation matrix P is a matrix whose columns 
consist of the n different vectors e;,1 <j <n, in any order. 


For example, 


P= (€1, €3, €4, €2) = 


CoCo OrF 
oro°e 
eS 
ooro 


Since a permutation matrix is a matrix whose columns are a rearrangement 
of the columns of the identity matrix, the effect of multiplying by a permuta- 
tion matrix P on the left is to rearrange the rows of A in the order in which 
the er appear in the rows of P. For example, 


010 1 2 3 45 6 
001 456]/= [789 
100 789 123 


Thus, by proper choice of P, any two or more rows can be interchanged. 
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Note: det P = +1, since P is obtained from J by row interchanges. 


Now, Gaussian elimination with row interchanges can be performed by the 
following matrix operations:® 


AM = M1) p(-) yy(r-2) pr-2) ... 2) PA) yD pO 4. 
o™ = Mer) p(-)... ye?) pA yD pHOp, 


It follows that U = LAY, where L is no longer lower triangular. However, if 
we perform all the row interchanges first, at once, then 


M°-)...MO PAgr = MY) Y@-?)... MO Po, 


or 7 - 
LP Az = LPb, 
so : 
LPA=U. 
Thus, 


PASi-*U SLU. 
We thus have the following result: 


THEOREM 3.7 

If A is a nonsingular n x n matrix, then there is a permutation matrix P 
such that PA = LU, where L is lower triangular and U is upper triangular. 
(Note: det(PA) = + det(A) = det(L) det(U).) 


We now examine the technique of pivoting in Gaussian elimination. Con- 
sider the system 


0.00017; + %2 = 1 


1+ t2 = 2: 


The exact solution of this system is 2, ~ 1.00010 and zg = 0.99990. Let 
us solve the system using Gaussian elimination without row interchanges. 
We will assume calculations are performed using three-digit rounding decimal 
arithmetic. We obtain 
aay 5 
mai — ~Gy © 0.1 x 10 5 


a4 
a?) — a) — maal) ~ 0.1 x 10' — 0.1 x 10° © —0.100 x 10°. 


8When implementing Gaussian elimination, we usually don’t actually multiply full n by 
nm matrices together, since this is not efficient. However, viewing the process as matrix 
multiplications has advantages when we analyze it. 
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Also, b°) = (0.1 x 10!,—0.1 x 10°)", so the computed (approximate) upper 
triangular system is 


0.1 x 10-32, + 01x10!x2= 01x 101, 
—0.1 x 105x2 = —0.1 x 105, 


whose solutions are x2 = 1 and x; = 0. If instead, we first interchange the 


equations so that alt) 


used. 


= 1, we find that x, = x2 = 1, correct to the accuracy 


Note: Small values of al”) in the r-th stage lead to large values of the mj,.’s 


and may result in a loss of accuracy. Therefore, we want the pivots al) to be 
large. 


Two common pivoting strategies are: 


Partial pivoting: In partial pivoting, the al’ for r <i <n, in the r- 


th column of A‘ is searched to find the element of largest absolute 
value, and row interchanges are made to place that element in the pivot 
position. 
Full pivoting: In full pivoting, the pivot element is selected as the element 
(r) 
° 
the (n —r) x (n—r) submatrix of A"), This strategy requires row and 
column interchanges. 


a r < 4,7 <n of maximum absolute value among all elements of 


Note: In practice, partial pivoting in most cases is adequate. 


Note: For some classes of matrices, no pivoting strategy is required for a 
stable elimination procedure. For example, no pivoting is required for a real 
symmetric positive definite matrix or for a strictly diagonally dominant ma- 
trix [99]. 


We now present a formal algorithm for Gaussian elimination with partial 
pivoting. In reading this algorithm, recall that 


A111 + 42%2 +++++ Ain = by 


G21 L1 + A22%2 + +++ + Gant = be 


Ani X1 + An2%2 + +++ + Anntn = bn. 


ALGORITHM 3.3 
(Solution of a linear system of equations with Gaussian elimination with par- 
tial pivoting and back-substitution) 
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INPUT: A € L(R") and be R®. 
OUTPUT: An approximate solution? x to Ar = b. 
FOR & =1,2,+:,n=1 


1. Find € such that |ae,| = max |aj,x| (k <0 <n). 
k<jgn 


2. Interchange row k with row 


Cj Ak; d —bpz 
Gri — Gey >for 7 =1,2,...,n, and by — be 
agi — Cj be -—d 


3. FORi=k+1,-°-,n 


(a) Miz — Giz /axr- 
(b}) FOR J =k, k+1,---,n 
Aig — Aig — MixAk;- 
END FOR 
(c) bj — b; — Mixde. 


END FOR 
4. Backssubstitution: 


(a) tp — bn /Ann and 
(b) tR— (b - S- anja;) /arn, for k =n —1,n— 2, eal 


j=k+1 


END FOR 
END ALGORITHM 3.3. 


REMARK 3.27 In Algorithm 3.3, the computations are arranged “se- 
rially,” that is, they are arranged so each individual addition and multipli- 
cation is done separately. However, it is efficient on modern machines, that 
have “pipelined” operations and usually also have more than one processor, 
to think of the operations as being done on vectors. Furthermore, we don’t 
necessarily need to change entire rows, but just keep track of a set of indices 
indicating which rows are interchanged; for large systems, this saves a signif- 
icant number of storage and retrieval operations. For views of the Gaussian 
elimination process in terms of vector operations, see [34]. For an example of 
software that takes account of the way machines are built, see [4]. 


9 approximate because of roundoff error 
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REMARK 3.28 One can show that the final system is A™a = 0”, 
where A) is upper triangular, A™ = M@-DpP@-)...MOPOA = MA, 
and b) = M(@-) pl)... Ww Pd, where the first r—1 columns of M) 
are the first r — 1 columns of the identity matrix, the last n — r columns of 


M are the last n — r columns of the identity matrix, and the r-th column 
of M) is 


T 
r 
MS = (O1xr—1 1 TMp+1 yr oS —Mn,r ) ’ 
where 01,1 is r— 1 zero’s, and 
r 
Pt ) = (e1, see Ep—1, €j, Cr4+1,--+,€j—-1, €r, €j41,--- sen) 


is an n X n permutation matrix where, at the r-th step, row r is interchanged 
with row j, and where e; denotes the j-th column of the identity matrix. 


We have A = M-!A\™ = M~1U, ie., A™ =U is upper triangular. How- 
ever, unless P‘”) = J for every r, M~! is not lower triangular. (We have 
proved that given any nonsingular matrix A, there exists a nonsingular ma- 
trix M such that MA = U.) 


REMARK 3.29 Note that 


det(A) = det(M~") det(U) 
= det(P™)-! det(M™)-!..-det(P@-P)-! det(M"-Y)—1 det(U). 


But 


—1 if a row has been interchanged 
(r))-1 _ (r)\y = Bes) 
dgiES = dete) = { 1 if no row has been interchanged, 
and det(M‘"))-1 = 1. Thus, 

det(A) = (—1)* det(U) = (-1)* ab a® -.-a™ 


nn? 


where K is the number of row interchanges made. 


REMARK 3.30 The inverse A~! can be found by solving n systems 
Ax; =e;,j =1,2,--- ,n, where (ej), = 6;, in a simultaneous manner. Then, 


At = (%1,22,°°° jen) 


O 


We now consider some special but commonly encountered kinds of matrices. 


Numerical Linear Algebra 117 


3.3.2. Symmetric, Positive Definite Matrices 


We first characterize positive definite matrices. 


THEOREM 3.8 


Let A be a real symmetric n x n matrix. Then A is positive definite if and 
only if there exists an invertible lower triangular matrix L such that A= LL’. 
Furthermore, we can choose the diagonal elements of L, €;;, 1 <71<n, to be 
positive numbers. 


Note: The decomposition with positive @;; is called the Cholesky factorization 
of A. It can be shown that this decomposition is unique. 


PROOF (of Theorem 3.8) Recall that A is positive definite if 7 Ar > 0 
for all « € R”, with equality only if x = 0. 


Part_1 (factorization exists implies positive definite) Let A = LL? with L 
invertible. Then 2? Ag = a7 Li? x. Let y = L?x. Then c? Ag = yly = 
yi tyst+---+y?2 > 0, with equality only if y= 0. But since L” is invertible, 
we have y = 0 only if = 0. Therefore, A is positive definite. 


Part 2 (positive definite implies factorization exists) Let A be symmetric pos- 
itive definite. It can be shown that the principal minor 6; satisfies [69]: 


for i = 1,2,--- ,n, for positive definite matrices. Thus, by Theorem 3.6, A 
has the LU decomposition 


1 0 O 0 U11 = * * 

« 1 0 0 0 wo * * 
A-fu-|* * 1 0 O O ugg... * 

* Ok 1 0 0 oa Unn 


That is, Lis lower triangular with unit diagonal entries and U is upper tri- 
angular. Since ITj-1 Ujy = 0; > 0, wae > O for 1 <i <n. Define the diagonal 


matrix A = diag (\/uai, /U22,"*: , /Unn). Then A = LU = LAA—1U. Now 
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define L = LA and V = A~!U. Then we have A = LV, with 


Jui, O -:- O JU. * * x 
wey * Vie : we ° tae i: * 
* * . 0 : * Bate OK 

* * * 4/Unn 0 --+ 0 \/Unn 


But A = A’, so LV = (LV)? = V7L’, and, since L is invertible, we have 
V(L7)-1 = L~!V7. Since (L7)~? is upper triangular and V7 is lower tri- 
angular, V(L7)~! is upper triangular with unit elements along the diagonal. 
Similarly, L~'V" is lower triangular with unit elements along the diagonal. 
Thus, VL )-* = LV = fo so Via EP, and thus A= LE: 


REMARK 3.31 L can be computed using a variant of Gaussian elimina- 
tion. Set 0,1 = ,/aqz and l51 = ay / fan for2 < Jon. (Note that x? Ax >0, 
and the choice x = e,; implies that a,;; > 0.) Then, for i= 1,2,3,---n, set 


iw 


li = | _ ¥( 


k=1 


t-1 

1 

ly = ra [es = ) tats fori+1<jg<n. 
k=1 


Note: If A is real symmetric and LE can be computed using the previous 
note, then A is positive definite. (This is an efficient way to show positive 
definiteness. ) 


Note: To solve Ax = b where A is real symmetric positive definite, LZ can be 
formed using the first note and the pair Ly = b and L’ x = y can be solved 
for x. 


Note: The multiplication and division count for Cholesky decomposition is 
n?/6+ O(n?). 


Thus, for large n, about 1/2 the multiplications and divisions are required 
compared to standard Gaussian elimination. 
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3.3.3 Tridiagonal Matrices 


A tridiagonal matrix is a matrix of the form 


ay Cy, QO -:: 0 
by ag C2 0 ves 0 
0 bs a3 «C3 sade 0 
0--- 0 bn—-1 Qn—1 Cn—-1 
O--- 0 bn an 


Suppose that A can be decomposed into a product of two bidiagonal matrices, 
that is, 


a, 0 ++: 0 ly... 0 
by Og. +e 0 


which gives 


Qa, = ai, 

V1 =¢1/a1, 

ay = ay — biVi-1 for i= 2,3,- on, (3.17) 
Yi = C/O; fori =2,---,n—1. 


Therefore, if a; # 0, 1 < i <n, we can compute the decomposition (3.16). 
Furthermore, we can compute the solution to Ar = f = (fi, fo,---, fn)? by 
successively solving Ly = f and Ux = y, i.e., 


yi = fi/on, 
bs 2 : — biyi-1)/o4 for i = 2,3,--+,n, (3.18) 


Ly = (yy — Vj2541) for j =n—1,n—-2,---,1. 


Sufficient conditions to guarantee decomposition (3.16) are given in the fol- 
lowing theorem. 


THEOREM 3.9 

Suppose the elements a;, bi, and c; of A satisfy |ai| > |c1| > 0, |ai| > |bi|-+leil, 
and bic; 4 0 for 2 <i <n-—1, and suppose |ay| > |b,| > 0. Then A is 
invertible and the a;’s are nonzero. (Consequently, the factorization (3.16) is 
possible. ) 


PROOF Since a; = a; and % = c/a, we have a, 4 0 and |y| < 1. We 
now proceed inductively. (Recall that a; = a; — bi9:-1, 71 = ci/ai.) Suppose 
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that, for some j such that 2 < j <n, we have |7,;-1| < 1. Then 
la; — by 3-1] = lag] — [bs|l¥-11 > lay] — [83] > 0, 

since |a;| — |b;| > |c;| > 0. Thus, Ja;| > 0, whence a; #0. Also, 


lejl les les 
loz] lag — 35-11 ~ Jagl — 1031 ~ 


since |a,;| — |b;| > |c;|. Thus, |y;| < 1. The inductive proof is thus finished. 
Hence, all a,;’s are nonzero. The nonsingularity of A follows from 


det(A) = det(L) det(U) = I a; #0. 


0 


Note: It can be verified that solution of a linear system having tridiagonal 
coefficient matrix using (3.17) and (3.18) requires (5n — 4) multiplications 
and divisions and 3(n — 1) additions and subtractions. (Recall that we need 
n3/3+ O(n?) multiplications and divisions for Gaussian elimination.) Storage 
requirements are also drastically reduced to 3n locations, versus n? for a full 
matrix. 


3.3.4 Block Tridiagonal Matrices 


We now consider briefly block tridiagonal matrices, that is, matrices of the 
form 
A; Ci 0 ::: 0 0 
Bg Ao Co 0 wee 0 
0 Bz; Ag C3 0 0 
A= 

0 0 O By-i An-1 Cn-1 
0 0 0 O By, An 


where A;, B;, and C; are m x m matrices. Analogous to the tridiagonal case, 
we construct a factorization of the form 


ay ae. 0 Pic xia. “70 
Ba -Ay «as 20 7 

Witecl err Ou Fe, 6 
0 0 En-1 


Numerical Linear Algebra 121 


Provided the As, 1<i<vn, are nonsingular, we can compute: 


A, = Ai, 

B, = Ay*Cy, 

A; = A; = B,Eji1 for 2 < a < n, 

E; = A71C,, for2<i<n-1. 


For efficiency, the As are generally not computed, but instead, the columns 


of E; are computed by factoring A; and solving a pair of triangular systems. 


PROPOSITION 3.10 

If the inverses AS are computed, the total number of multiplications and 
divisions to complete the factorization into a lower block triangular and an 
upper block triangular matrix as just described is 3(n —1)m3 (depending on 
the algorithm used to compute inverses), while, if the E;’s are computed using 


Gaussian elimination, the leading term 3nm? is reduced to Snm?. 


PROOF The proof is left as Exercise 24 on page 182. ] 


Since A is an nm x nm matrix, standard Gaussian elimination would require 
3n>m3 + O(n?m?) multiplications and divisions. Clearly, tremendous savings 
are achieved by taking advantage of the zero elements. 

Now consider 


Vy by 

v2 be 
Arx=b, “= , b=] . 1], 

Ln bn 


where x;,b; € R™. Then, with the factorization A = LU, Ax = b can be 
solved as follows: Ly = b, Ux = y, with 


Aly = bi, 
Ay; = (0; —Byy-1) fori =2,---,n, 
Tn = Yn, 
Lj, = Yj — Lj Xj41 for j =n-—-1,---,1. 


3.3.5 Roundoff Error and Conditioning in Gaussian Elimi- 
nation 


Round-off error analysis for Gaussian elimination can be divided into a 
description of condition numbers and ill-conditioned matrices, round-off error 
analysis, and iterative refinement. 
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3.3.5.1 Condition Numbers 


We begin with the following: 


DEFINITION 3.29 If the solution x of Ax = b changes drastically when 
A or b is perturbed slightly, then the system Ax = b is called ill-conditioned. 


Note: Because rounding errors are unavoidable with floating point arith- 
metic, much accuracy can be lost during Gaussian elimination for ill-condi- 
tioned systems. In fact, the final solution may be considerably different than 
the exact solution. 


Example 3.7 
An ill-conditioned system is 


_f 1 0.99\ fai) _ (1.99 ; ene ERD ft 
Ax = Ga a S = a , whose exact solution is x = (1) : 


However, 


1.989903 . 3 
Ag = Gees has solution x = ee ; 


Thus, a change of 


—0.000097 2.000 
6b= ( Antone) produces a change 62 = G4 : 


0 


We first study the phenomenon of ill-conditioning, then study roundoff error 
in Gaussian elimination. We begin with 


THEOREM 3.10 
Let || - ||g be an induced matriz norm. Let x be the solution of Ax = b with 
A ann xn invertible complex matric. Let x + 6x be the solution of 


(A+ 5A)(a + dx) = b+ 0b. (3.19) 
Assume that 
[6A |4-Mp <1. (3.20) 
Then 
Wells yaa Hay 1 (ldblls , IBA 
Tate SPC WALA (pet pap)» 82D 
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where 


Kg(A) = ||AllalA~*lls 
is defined to be the condition number of the matrix A with respect to norm 
Il - Ile. 
PROOF By subtracting Ax = 6 from (3.19), we obtain 
(A + bA)bx = 6b — OAx. 
Then (3.20) and Proposition 3.9 (on page 104) imply the following. 
(1) (I+ A7!6A) is invertible. 
(2) Since A+ 5A = A(I + A~16A), we have (A + 6A)~? exists. 
Thus, by (1) and (2) above, 
6x = (A+6A)~*(6b— 6Ax) = (I+ A~'6A)~+A71(5b — 5 Azz). 
Hence, 
|Szllo < I+ ASA)“ AH alld — 5 Az 


< ’ : 
Toa spaa eA Melodia + lloAllallel); 


IA 


where the last inequality follows from Proposition 3.9 on page 104. Hence, 


[Sala Ka(A) ldblla —, I6Alle 
[elle ~ 1 14-*[alloAlla Caner * TAlle 


and (3.21) follows from |lb]|3 = ||Azlls < ||Allallalls. 0 


Note: The condition number kg(A) > 1 for any induced matrix norm and 
any matrix A, since 


1 = |[I||o = |A~*Alla < \|A7*[|||Alla = x@(A). 
Note: If 6A = 0, we have 
||5| 


|6z|| 6 la 
< kg(A) rls 
[alle <8 Tella 
and if 6b = 0, then 
llorlle kg(A) |All 


lta ~ 1—|[A~*Tall6Alla IAlle © 


Note: There exist perturbations 6x and 06 for which (3.21) holds with equal- 
ity. That is, inequality (3.21) is sharp. 
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REMARK 3.32 Consider || - ||g = || - ||2, where 
Ax 
|All = sup {! Bl. 
#0 | ||xIl2 


Let 1 > po > p3 > ++: > pn > O be the eigenvalues of A” A, with A 
invertible. (Recall that, when A is nonsingular, A” A is positive definite and 
has n positive eigenvalues.) Thus, 


||All2 = y p(A" A) = Vin, 


A“ fa = /o((A-9)A-1) = /p((A# A)-1) = os 


(Notice that the eigenvalues of C~! are 1/\;, 1 <i <n, if the eigenvalues of 
C are ;, 1 <i<n.) Thus, 


K2(A) = || All2|| A7*l]2 = Va /bn- 


In the particular case that A is Hermitian, 


and 


K2(A) = |Al/lol, 


where |A| = max |\;(A)| and |o| = min |),(A)]. 
1<i<n 1<i<n 


Note: For a unitary matrix U, ie., U7U = I, we have k2(U) = 1 (Exercise 26 
on page 182). Such a matrix is called perfectly conditioned, since Kg(A) > 1 
for any 6 and A. 


Example 3.8 
Consider the earlier example (Example 3.7) of an ill-conditioned matrix: 


1099) 4. 
A= fe a , with eigenvalues \1 ~ 1.98005, \2 * —0.00005. 


Since A is Hermitian, K2(A) = |A1|/|A2| © 39601. Recall that 


and the perturbed system was 


i \ _ (1.989903). a 
A & = Gani , #1, = 3.000, #2 = —1.0203. 


Numerical Linear Algebra 125 


Thus, a change 


— Connie 


2.0000 
s OnDNG) produced a change 62 = ( ) 


—2.0203 


Computing ||6z||2/||z||2 ~ 2.010175 and ||6b||2/||bllz = 0.513123 x 10-4, we 
have |[6z]|2/||a||2 ~ 39175||db]|2/||bl|2. (Recall that «(A) = 39601.) Thus, in 
this example, (3.21) is fairly sharp. (Note that 6A = 0.) This example clearly 
illustrates the phenomenon of ill-conditioning. (An uncertainty of 107+ in 
elements of b results in an uncertainty of about 2 in elements of z.) 


A classic example of an ill-conditioned matrix is the Hilbert matrix of order 


nN: 

1 iI 1 
1 3 3 a 
iol i a 
2 3 4 n+1 

Ay = 

1 1 1 1 
n ntl n+2 2n-1 


Note that HY = H,, (H» is Hermitian). Condition numbers for some Hilbert 
matrices appear in Table 3.1. 


TABLE 3.1: Condition numbers of some Hilbert matrices 
A 
wt) ; 


REMARK 3.33 Consider Az = 0. Ill-conditioning combined with round- 
ing errors can have a disastrous effect in Gaussian elimination. Sometimes, 
the conditioning can be improved (« decreased) by scaling the equations. A 
common scaling strategy is to row equilibrate the matrix A by choosing a 
diagonal matrix D, such that premultiplying A by D causes ee la;;|=1 


for i = 1,2,---,n. Thus, DAx = Db becomes the scaled system with maxi- 
mum elements in each row of DA equal to unity. (This procedure is generally 
recommended before Gaussian elimination with partial pivoting is employed 
[40]. However, there is no guarantee that equilibration with partial pivoting 
will not suffer greatly from effects of roundoff error.) 
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3.3.5.2 Roundoff Error in Gaussian Elimination 


Consider the solution of Ax = b. On a computer, elements of A and 6 
are represented by floating point numbers. Solving this linear system on a 
computer only produces an approximate solution 2%. 

There are two kinds of rounding error analysis. In backward error analysis, 
one shows that the computed solution # is the exact solution of a perturbed 
system of the form (A+ F)% = b. (See, for example, [68] or [100].) Then we 
have 

Av — Ai = —F%, 


that is, 
g-£=-—A 'F, 


from which we obtain 


Iz = Bloc 


Il2lloo 


S | A *|loo|| Fllec = Kio (A) —. (3.22) 


Thus, assuming that we have estimates for K.(A) and ||F'l|., we can use 
(3.22) to estimate the error ||x — #|| oo. 

In forward error analysis, one keeps track of roundoff error at each step of 
the elimination procedure. Then, x — & is estimated in some norm in terms 
of, for example, A, (A), and 0 = §3'~* [88, 89]. 

The analyses are lengthy and are not given here. The results, however, are 
useful to understand. Basically, it is shown that 


< eng, (3.23) 


where 


Cn is a constant that depends on size of the n x n matrix A, 


k 
Max; j,k jae 


g is a growth factor, g = and 


max;,j |aij| ” 
@ is the unit roundoff error, 6 = spit. 


Note: Using backward error analysis, c, = 1.01n? + 5(n + 1)?, and using 
forward error analysis, cn = $(n? + 15n? + 2n — 12). 


Note: The growth factor g depends on the pivoting strategy: g < 2”~! for 


partial pivoting,!° while g < n!/?(2-31/?.41/3...n1/"—1)1/ for full pivoting. 
(Wilkinson conjectured that this can be improved to g < n.) For example, 


10T¢ cannot be improved, since g = 2”! for certain matrices. 
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for n = 100, g < 2°° = 10°° for partial pivoting and g < 3300 for full pivoting. 


Note: Thus, by (3.22) and (3.23), the relative error ||xz — #||../||#||.. depends 
directly on Ko(A), 0,n°, and the pivoting strategy. 


REMARK 3.34 The factor of 2-1 discouraged numerical analysts in 
the 1950’s from using Gaussian elimination, and spurred study of iterative 
methods for solving linear systems. However, it was found that, for most 
matrices, the growth factor is much less, and Gaussian elimination with partial 
pivoting is usually practical. 


3.3.6 Iterative Refinement 


We now consider a technique called iterative refinement, which is sometimes 
used to decrease the rounding error in Gaussian elimination. The following 
assumption is made in iterative refinement: The solution to Ax = b is com- 
puted using Gaussian elimination with t digits of precision, while the solution 
to 

r=b— Ax 


is computed using 2t digits of precision. (r is called the residual vector), or, 
simply, residual. 


ALGORITHM 3.4 

(Iterative refinement procedure) 
INPUT: A matrix A € R”*", a vector b € R”, and a tolerance e. 
OUTPUT: A refined approximation x to Ax = b. 


1. Compute an initial approximation «© to Ax = b using Gaussian elimi- 
nation with t-digit arithmetic. (All multiplier and interchange informa- 
tion is saved to speed up later Gaussian elimination calculations.) 


2k—0. 
3. DO WHILE |jx@t+) — 2 || > e. 


(a) r®) —b— Aa, (Computed using 2t-digit arithmetic.) 


(b) Solve Ay) =r for y™, using t-digit arithmetic and using mul- 
tiplher and interchange information from step 1. 


(d) k—k+1. 


END DO 
4. RETURN «™) as a. 
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END ALGORITHM 3.4. 


We now have the following question: using this procedure, does 2") = x 
as k — oo? In the analysis, it is assumed that r“) is computed exactly as 2t 
digits are used. 


THEOREM 3.11 
Suppose that A is nonsingular, y‘*) is the exact solution to 


(A+ Fey =r, 


and r*) = b — Ax is computed exactly. If there is a constant y > 0 for 
which 
|FelI|A* | <7 < 1/2 


for k = 0,1,2,---, then the A+ Fy, k = 0,1,2,---, are nonsingular, and 
Tr > Lasko. 
PROOF Consider A + F, = A(I + A7!F,). We have 

||A7* Fell < ATI Fell < 1/2. 


Thus, J+ A7!F; is nonsingular, 
(P+ AMR) = SATAY, 


and 
1 1 


I+ AP) lS Tag Sa RT 


Hence, A+ Fy, is nonsingular. Now consider 


HD = 9 4 yl) <9 4 (AE RIO 
=a) 4 (A+ Fy) 1(b- Ax) 
=(A+F)7 [(A + Fy)2™ +b— Ac) 


= (A + Fy)" [Fea + b| 
Hence, 


OLY cis 5a (A+ Fe)~ 1 Ra) —(A+Fy)a+b 
=(A+F,) F(a — 2) 
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Thus, |Ja%t+) — al < ||(A+ F,)7!Fs|| ||2™ — a||. However, 


I|(A+ Fa)? Fell < (A + Fa)" [Fell = PAT + AW Fe) Fl 


| Aw" Fre 
~ 1 |All 
1 
ea <1, since 7 < 5 


Thus, 


k 
yo — as (LL) fot as (LE) fo — al] 0 a5 bs 00 
mea, meta} 


Hence, x") 3 x as k > ov. 


Note: In the roundoff error analysis for Gaussian elimination, 


|| Felloc < [1-01n? + 5(n + 1)?] || Alloog 0. 


Therefore, by Theorem 3.11, it is sufficient that ||Fl]oo < 1/(2\|A7+l0), 
giving the condition 


[1.01n° + 5(m + 1)*] Koo(A)g@ < 1/2. 


This inequality implies that iterative refinement will converge if 6 = spi 


is sufficiently small (¢ large enough) and n, the condition number, and the 
growth factor are not too large. 


Example 3.9 
3.3330 15920. —10.333 X41 15913. 
2.2220 16.710 9.612 t | = | 28.544 
1.5611 5.1791 1.6852 x3 8.4254 


For this problem, k..(A) © 16000 and the exact solution is 2 = [1,1,1]7. Us- 
ing Gaussian elimination with t = 5 and 6 = 10 (5-digit decimal arithmetic), 


xo = [1.2001, 99991, 0.92538)”. 
In 10-digit arithmetic, one obtains 
ro = [—0.00518, .27413, —0.18616]”. 
In 5-digit arithmetic, solving Ayo = ro gives 


yo = [—0.2008, 8.9987 x 107°, 0.074607]°. 
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Then, 
r1 = £9 + yo = [1.0000, 1.0000, 0.99999] 


and 
LQ = x21 + yi = [1.0000, 1.0000, 1.0000]. 


3.3.7 Interval Bounds 


In many instances, it is practical to obtain rigorous bounds on the solution 
x to a linear system Az = b. The algorithm is a modification of the gen- 
eral Gaussian elimination algorithm (Algorithm 3.1) and back substitution 
(Algorithm 3.2), as follows. 


ALGORITHM 3.5 
(Interval bounds for the solution to a linear system) 


INPUT: A € L(R”) and b € R”. 
OUTPUT: an interval vector x such that the exact solution to Av = b must 
be within the bounds zx. 


1. Use Algorithm 3.1 and Algorithm 3.2 (that is, Gaussian elimination with 
back substitution, or any other technique) and floating point arithmetic 
to compute an approximation Y to A~!. 


2. Use interval arithmetic, with directed rounding, to compute interval en- 
closures to Y A and Yb. That is, 
(a) AYA (computed with interval arithmetic), 
(b) b —Yb (computed with interval arithmetic). 


3. FOR k = 1,2,---,n—1 (forward phase using interval arithmetic) 


FOR i=k+1,---,n 

(a) Mik — Qin /Gxx.- 

(b) Giz — [0,0]. 

(c) FOR J=k+1,---,n 


ai; = aij — Mp Ap; - 


END FOR 
END FOR 


END FOR 
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4. Bn — bnfGan- 
5. FOR k=n—1,n—2,--- ,1 (back substitution) 


wy, — (Bk — hing Gy @;)/Ann- 
END FOR 
END ALGORITHM 3.5. 


Note: We can explicitly set a; to zero without loss of mathematical rigor, 
even though, using interval arithmetic, ai,—mi,ax~ may not be exactly [0,0]. 
In fact, this operation does not even need to be done, since we need not ref- 
erence ax in the back substitution process. 


Note: Obtaining the rigorous bounds a in Algorithm 3.2 is more costly 
than computing an approximate solution with floating point arithmetic using 
Gaussian elimination with back substitution, because an approximate inverse 
Y must explicitly be computed to precondition the system. However, both 
computations take O(n*) operations for general systems. 


The following theorem clarifies why we may use Algorithm 3.5 to obtain 
mathematically rigorous bounds. 


THEOREM 3.12 
Define the solution set to Ax =b to be 


¥(A, b) = {x | Av =5 for some Ac Aandi cb}. 


If Az* = b, then x* € »(A,b). Furthermore, if x is the output to Algo- 
rithm 3.5, then X(A,b) C «. 


For facts enabling a proof of Theorem 3.12, see [62] or other references on 
interval analysis. 


Example 3.10 


Let the matrix A and the right-hand-side vector b be as in Example 3.9 (on 
page 129). We will use IEEE double precision (see Table 1.1 on page 19) 
to compute Y, and we will use interval arithmetic based on IEEE double 
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precision.'t Rounded to 14 decimal digits,'* we obtain 


—0.00012055643706 —0.14988499865822 0.85417095741675 
Yx 0.00006278655296 0.00012125786211 —0.00030664438576 
—0.00008128244868 0.13847464088044 —0.19692507695527 


Using outward rounding in both the computation and the decimal display, we 
obtain 


[0.00000000000000, 0.00000000000001] —[1.00000000000000, 1.00000000000001] 
[0.00000000000000, 0.00000000000001] —[0.00000000000013, 0.00000000000014] 


[—0.00000000000001, —0.00000000000000] ) 
? 


A ( [1.00000000000000, 1.00000000000000] [—0.00000000000012, —0.00000000000011] 


[—0.00000000000001, —0.00000000000000] 
[0.99999999999999, 1.00000000000001] 


and 


[1.00000000000000, 1.00000000000001] 


me [0.99999999999988, 0.99999999999989] 
bC : 
- [1.00000000000013, 1.00000000000014} 


Completing the remainder of Algorithm 3.5 then gives 


[0.99999999999999, 1.00000000000001] 
x* € aC | [0.99999999999999, 1.00000000000001] 
[0.99999999999999, 1.00000000000001] 


U 


Note: There are various alternate ways of using interval arithmetic to obtain 
rigorous bounds on the solution set to linear systems of equations. Some of 
these are related mathematically to the interval Newton method introduced 
in §2.5 on page 54, while others are related to the iterative techniques we 
discuss later in this section. The effectiveness and practicality of a particular 
such technique depend on the condition of the system, and whether the entries 
in the matrix A and right hand side vector 6 are points to start, or whether 
there are larger uncertainties in them (that is, whether or not these coefficients 
are wide or narrow intervals). A good theoretical reference is [62] and some 
additional practical detail is given in our monograph [44]. 

We now consider another direct method for computing the solution of a 
linear system Ax = b. 


3.3.8 Orthogonal Decomposition (QR Decomposition) 


This direct method for computing the solution of Ax = b is based on orthog- 
onal decomposition, also known as the QR decomposition or QR factorization. 


11 We used MATLAB for our actual computations, and we used the INTLAB toolbox for MATLAB 
for the interval arithmetic. 
12as MATLAB displays it 
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In addition to solving linear systems, the QR factorization is also useful in 
least squares problems and eigenvalue computations. The goal of orthogonal 
decomposition is to find A = QR, where Q is orthogonal (i.e., Q7Q = I) and 
R is upper triangular. Then Ax = b has the form QRx = b, so Rx = Q™, 
and this can be rapidly solved using backsolving. 

We will assume in this section that A is a real matrix. 

There are several common ways of computing QR decompositions. 


3.3.8.1 QR Decomposition with Householder Transformations 


We will first show how to compute the QR decomposition with Householder 
transformations. 


DEFINITION 3.30 Let u € R” and suppose that ||u|/} =1 =utu. Then 
then x n matrit U = I — 2uu? is called a Householder transformation. 


As proved in the following lemmas, Householder transformations have sev- 
eral interesting properties. 


LEMMA 3.1 

Let U be a Householder transformation. Then UT = U, UTU =I (U is 
orthogonal) and U? =I (U is involutory). 

PROOF Clearly U7 = U. Then 


UTU =U? = (I — 2uu?)? =I — duu? + 4uuTuu? = I, 


since uu = 1. 
LEMMA 3.2 
Letu€ R",uF0, and 0 = $|lull3. Then 
T 
wu 
U =I -— 
6 


is a Householder transformation. 


PROOF Let v = (1/||ull2)u, so ||ul|z = 1. Then, 
U=I=(1/0)uu? = I = 2vv7, 


so U is a Householder transformation. 1] 


The next lemma shows that Householder transformations have the impor- 
tant capability of introducing zeros into vectors. 
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LEMMA 3.3 
Let x € R", x £0, and let x1 be the first element of x. Let 


o = sgn(x1)|\2"||2 
(where sgn(0) = 1), let 
Ly ae 
u=a2+oe, and 0= 5 llull2. 


Then 
U =I -(1/0)uu™ 


is a Householder transformation, and Ux = —oe,. 


PROOF Note that u=2+o0e, #0, since x 4 —oe,. Thus, by Lemma 3.2, 


U is a Householder transformation. Consider «7x = o? and 
1 1 
0 = sllulla = zulu 


1 1 
= 5(@ + 0€1)' (a +01) = z (ere +or%e, +071 e, + ce) 2) 


1 
= 5 (20° + 2021) =o? +02}. 


Then, 
eee ae (w+ oe1)(@ + e1)Fe 
0 a2 +042, 
De i 
=x-—(x+0€1) eee =-—¢e}. 
Oo“ + OX, 


0 


To avoid problems with underflows and overflows in computation of U, x 
is scaled at the beginning of the computation. The algorithm proceeds as 
follows. 


ALGORITHM 3.6 

(Computing a Householder transformation.) 
INPUT: « € R” with « 4 0. 
OUTPUT: oa, 0, and u such that Ux = (I — uu? /0)x = —cey. 


1. v=2/|2\lo0 (te, ||v|loo = 1). 
= sgn(v1)||0]2. 


2.6 
8 UW = +0,u,=Yy; for2<i<n. 
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4. d= oul. (0 = a? + ov}, ). 
5. 6 =G||2\l00 (Note that o = sgn(x1)||z||2). 
END ALGORITHM 3.6. 


Steps 1 and 5 of Algorithm 3.6 are to avoid catastrophic underflows and 
overflows in the computation of ||v||2 in Step 2. See Example 1.16 on page 21. 
Householder transformations can also be defined for vectors u € C”: 


DEFINITION 3.31 Let u € C” be such that ||u||} = uu = 1 and define 
U=I-2uu". Then U¥ =U and U#U =I. (Such a complex Householder 
transformation U is called a complex reflector.) 


The following useful result for complex reflectors is analogous to Lemma 3.3, 
that is, it allows us to transform a complex vector to a unit vector. 


LEMMA 3.4 ; ; ; 
Let x EC", ¢ £0, 21 = re”, anda =e” |\z|\2 (with 6 = 0 if x1 = 0). Let 
u=x+oe,0 = 4|lul[3, and U =I —uu/0. Then Ux = —ce 


PROOF Note that c#¥z =@o and 
1 1 
6= -uZ@u= 3 Ea +724] E + cel] 


a ete +r +0781 +00 
2 


lIzlI3 + rilalle. 
Now, 


A 


u 
Ur =2-——[u" a] 


0 
= gets + 5x1] 
=2— Tes epee + viata 


= —0€1. 
O 


Now consider the transformation of a matrix A into upper trapezoidal form 
using Householder transformations. It is assumed here that A is m x n, with 
m > n. We will find a sequence of Householder transformations that will 
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transform A into upper trapezoidal form, i.e., into the form (#), where R is 
an mn X n upper triangular matrix and 0 is an (m — n) x n matrix of zeros. 


THEOREM 3.13 

Let A be a realm xX n matrix with m > n. Then there exist Householder 
transformations U,,U2,---,U, such that Ap, = U,U,_1---U,A is upper 
trapezoidal, where r = min(m — 1,n). 


PROOF At the k-th stage, we construct a matrix U, such that multi- 
plication on the left by U;, introduces zeros below the diagonal in the k-th 
column. Let A = AM = [e, M4], where ce) € R™ and M, is an mx (n—1) 
matrix. Using Algorithm 3.6, we construct a Householder transformation 
Uy =I —uYu" /6, such that Upc) = r14(1,0,--- ,0)7 ER™. Then 


Til 
0 
AD = AO = (rie) |i ) = , U,M, 


We now suppose that at the k-th step we have 


A® = Up_1Up—g + -U, AM = 


where R; is (k — 1) x (k — 1) upper triangular, r) € R*-1, &) eR Ft, 
By, is (k —1) x (n—k), M; is (m—k+1) x (n—&), and “0” represents an 
(n —k+1) x (k—1) matrix consisting entirely of zeros. Using Algorithm 3.6, 
we find u\®) ¢ R™-*+! and 6, such that the (m —k +1) x (m—k+1) 
matrices Uf. = I—u\) u\)T /6;, are Householder transformations and Ujc\) = 
ree(1,0,-+. ,0)7 € R™-*+!, We define 


Ip_1| 0 
U, = 0 lu}: 


U;, is a Householder transformation, since, if U, = Im—r+1 — 2vv? where 
\Jv|l2 = 1, then Up = Im — 2ww?, where w = (0,--- ,0,v7)7 € R™. Now, 
using the above matrices, 


ACkt+Y) = U, A) s 
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where Rz+1 is upper triangular, i.e., 


R,|r™ 
fen HE) 


In general, we need r — 1 steps to reduce A to upper triangular form, where 
r <min{m,n} is the rank of A. 


Note: To compute the QR factorization of an m x n matrix A, where m > 
mn and the rank of A is r = n, we denote the product of the Householder 
transformations by Q, that is, Q = U,U;,_1...U,;. We then partition the 
upper trapezoidal matrix A+ into either 


R 
(r+1) _ 
QA=A -(#). 


where R is an n X n upper triangular matrix. Observe that Q is orthogonal, 
since 

Q7Q =UfUs .--UTU,-.-U, =I. 
We now partition the m x m matrix Q? into the form Q? = (Q:i|Q2) , where 
Qi ismxnand Q2 ism x (m—n). Notice that Q; has orthonormal columns 
because Q has orthonormal columns. Since 


A=QTATTH) = (Q:|Qz2) (=) = QR, 


A= QRis the desired QR factorization of matrix A. Notice that Risannxn 
upper triangular matrix and Q; is an m xn matrix with orthonormal columns. 


Note: In the special case where m = n, then A = QR where Q and R are 
nxn. Then, Ax = b can be solved by backsolving Rr = Q™b. However, 
this algorithm, which produces the QR factorization, requires O(8n*) multi- 
plications and divisions. This method is, however, generally more stable with 
respect to rounding errors [40]. 


3.3.8.2. QR Decomposition with Givens Rotations 


The QR factorization can also be computed using Givens transformations, 
which can introduce zeros at any position in a matrix. A sequence of Givens 
transformations can thus be used to transform A to upper triangular form, 
ie., P;P;1---P, A= R, where Q = Hf_, P;. To understand Givens transfor- 
mations (plane rotations), consider first « € R?, and let 

T1 T2 


y= and o0= ‘ 
I|z\l2 I|x\|2 


(5) (8) 


Then 
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To introduce a zero into the j-th location of a vector x of length n, let 


where 


Xi 


,=—__ —_=—-. 
qf By + 25 4/8 +25 


Then it is straightforward to show that Pgh =I (P,; is orthogonal), and 


T 
a | a2 2 
Pyw= (eimai: oe, v5 foe riage aye 2 Opti sc ashy) . 


3.3.8.3 QR Decomposition with the Gram—Schmidt Procedure 


ie 
and o = z 


A third way to obtain the QR factorization is with a Gram—Schmidt or- 
thogonalization procedure. Let A = (a1,d2,...,G@n), where a; € R™ for 
i = 1,2,...,n and m > n, and assume that aj,,...,@, are linearly inde- 
pendent. In the Gram—Schmidt process, let 


gq = a1 
k-1 Gud) 

dk = an — S> a5Kng;, Aj, = =, k = 2,3, »n 
= (45,95) 

Then the q;, i= 1,2,...,n are orthogonal. To see why, consider 
a 
g2 = 42 — 1291, A112 = {q1,a2) 
(m1, 1) 


Then (q1,@2) = (q1, 42) — Q12(q1,41) = 0. Also, go 4 0 because ag is inde- 
pendent from a;. Continuing in this way, it can be verified by induction the 
entire set of gq, is orthogonal. 
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Now notice that 


qa = a 
q2 = a2 — 21241 


q3 = 43 — 1341 — 2342 


so we can write 
a 
eS Sina; TORUS 18h (3.24) 
j=l 


Now consider A = QR = [q1,42,---;Gn|R, so AR! = Q. However, writing 
(3.24) in matrix form, we obtain 


tii tia +++ tin 
(a1, @2,---,@n) : = (M1; 92;---5n), 
ae 
which gives t1,@1 = q1, tiga, + tezad2 = qo,.... This shows that R7! can be 


chosen to have entries that are the coefficients aj, computed in (3.24), and 
thus that the Gram—Schmidt process is equivalent to finding a QR decompo- 
sition of A. However, we see that 


QQ = diag (qf 91,93 G25°-* Gn dn) =AF I. 
Nonetheless, setting Q = QA~? and R= A? R, we obtain 
OR = QA-?A2R= A, where Q7Q = A~?Q7QA7? =I. 


We treat the Gram—Schmidt process somewhat more formally, in the con- 
text of general inner product spaces, in Section 4.2.3 on page 201. 


REMARK 3.35 The Gram-—Schmidt process as we have just explained 
it is numerically unstable. However, it can be modified to take account of the 
effects of roundoff error. This modified Gram-—Schmidt process is sometimes 
used in practice [78]. ] 


3.3.8.4 Least Squares and the QR Decomposition 


Overdetermined linear systems (with more equations than unknowns) oc- 
cur frequently in data fitting, in mathematical modeling and statistics. For 


example, we may have data of the form {(t;, yi)};".,, and we wish to model 
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n 


the dependence of y on t by a linear combination of n basis functions {pj };—1> 


that is, 
y = f(t) = >° apt), (3.25) 
i=l 


where m > n. Setting f(t;) = yi, 1 <i < m, gives the overdetermined linear 
system 


gilti) p2(t1) Yn(t1) Di Yi 
yi(t2) alta) Yn(t2) r2 yo 
= ; (3.26) 
Yiltm) Paltm) --- Yn(tm) In Ym 
that is, 
Agr = b, where A € L(R",R™), aig = p; (ti), and b; = Yi- (3.27) 


Perhaps the most common way of fitting data is with least squares, in which 
we find x* such that 


ae 1 
gllAe* — b]]2 = min v(x), where (x) = 5||Ax— dd. (3.28) 


(Note that x* minimizes the 2-norm of the residual vector r(x) = Az — b, 
since the function g(u) = u? is increasing.) 

The naive way of finding x* is to set the gradient Vy(a) = 0 and simplify. 
Doing so gives the normal equations: 


A? Ax = A™b. (3.29) 


(See Exercise 28 on page 183.) However, the normal equations tend to be 
very ill-conditioned. For example, ifm =n, K2(A? A) = k2(A)?. Fortunately, 
the least squares solution z* may be computed with a QR decomposition. In 
particular, 


|| Ax — blz = ||QRex — blz = ||Q7 (QRx — b)||2 = || Rx — Q7 blo. 


(Above, we used ||U2||2 = ||z||2 when U is unitary; see Exercise 25 on page 182 
below.) However, 


2 


||Ra — QTHIZ =S— [4 So rigs p — (QT) | + So (Q7D)?. (3.30) 
i=1 j=l i=m+1 


Observe now: 


1. All m terms in the sum in (3.30) are nonnegative. 
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2. The first n terms can be made exactly zero. 
3. The last m— n terms are constant. 


Therefore, 


p Tpy2 
min || Ax — bl|2 = 2 @ b)?, 
and the minimizer «* can be computed by backsolving the square triangular 
system consisting of the first n rows of Rr = Q7b. 
We will say more about least squares approximations in Chapter 4, in Sec- 
tion 4.3.6. We now turn to iterative techniques for linear systems of equations. 


3.4 Iterative Methods for Solving Linear Systems 
Here, we study iterative solution of linear systems 
Ae Shy es, 2 Wate Big J 15) 244 (3.31) 
k=1 


Good references for iterative solution of linear systems are [49, 68, 96, 103]. 
Why may we wish to solve (3.31) iteratively? Suppose that n = 10,000 or 
more, which is not unreasonable for many problems. Then A has 10° elements, 
making it difficult to store or solve (3.31) directly using, for example, Gaussian 
elimination. A simple way to obtain iterative methods is to split A as 


ASM Ne (3.32) 
with M nonsingular. Then (3.31) becomes 
Mx =Na+b. (3.33) 


Now suppose that it is “easy” to solve My = q by a direct method, for 
example, if M is triangular. Then given 2), we generate iterates by solving 


Mc) = No +b for k=0,1,2,... (3.34) 


If «*) — x as k > oo, then (3.34) gives Ax = b, that is, the limit vector is 
the solution to the original problem. Define the iteration matrix B by 


B=M'Nandc=M~1b. (3.35) 
Then (3.34) has the form 


c+) — Be +¢, k=0,1,2,..., (3.36) 
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0) 


where x) is an initial guess. If x) — a as k > oo, the solution x satisfies 


x= Bate, 


sox = M-!Na+4+c,so Mx = Na+b,so Ar =b. 


Note: Iterates defined by (3.36) can be viewed as fixed point iterates that 
under certain conditions converge to the fixed point. 


DEFINITION 3.32 The iterative method defined by (3.36) ts called con- 
vergent if, for all initial values e), we have x) — A-1b as k > oo. 


Let €*) = x") —x for k = 0,1,2,... be the errors in the k-th iterate. Then, 
since « = Br + cand x(t) = Br + ¢, we have 


ek) — Be), that is, ®t) = BRt16 for k=0,1,2,... (3.37) 


We see that «) — x is equivalent to «‘*) — 0. Thus, the iterative method is 
convergent if and only if B’e — 0 as k > oo. We have the following result. 


THEOREM 3.14 
Let x be the solution of Ax = b. The following are equivalent: 


(a) the iterative method (3.36) is convergent, i.e., for all x© we have x) > 
task—o. 


(b) the spectral radius obeys p(B) < 1; 


(c) there exists a matrix norm || - || such that ||Bl| < 1. 


PROOF Recall first that Theorem 3.5 on page 102 states that the following 
are equivalent: 


(a) lim A* =0. 


k— oo 
(b) jim A*®y =0 Va €C”. 
(c) p(A) <1. 
(d) There exists a matrix norm || - || such that || Al] < 1. 


We will show (a) > (b) > (c) => (a). Let the iterative method (3.36) be 
convergent, and let y be a given vector. Let «© = y+ a where « = A~!b. 
Since (3.36) is convergent, B'y — 0 as k — oo for any y. Hence, applying 
Theorem 3.5 gives (a) > (b) = (c). Finally, suppose that (c) is given; then 
lleFFY || < BUF lle || + 0, so (c) > (a). l 
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We study first three basic iterative methods: Jacobi, Gauss-Seidel, and 
SOR. Given an n x n matrix A, we can split A into A= L+ D+U, where L 
is strictly lower triangular, D is diagonal, and U is strictly upper triangular. 
Specifically, 


O -:- 0 O aig +++ Gin 
L rot @21 ) U — r} (3.38) 
: an—-1,n 
GQni *** Ann—1 0 0 -:: 0 
and D = diag(a11, d22,...,@nn)- 
3.4.1 The Jacobi Method 
We first consider the Jacobi, or total step method.” Let 
M=DandN=-(L+U), (3.39) 


where M and N are as in (3.34), and L, D, and U are as in (3.38). If a, 40 
for i=1,2,...,n, then M is nonsingular, and B = M~!N has the form 


B=-D"'\(L+U)=/. (3.40) 


J is called the iteration matrix for the “point” Jacobi method (versus the iter- 
ation matrix for the “block” Jacobi method). The iterative method becomes: 


a*t) — _p-l(n+U)c) + D-'b, k=0,1,2,... (3.41) 
Generally, one uses the following equations to solve for 2+): 


20 


is given 
t-1 n 
(k+1) _ 1 SF (k) 9 (k) (3.42) 
x; = ae b; — Aijr,; _ Aijt; 3 


fork > 0 and 1 <i<n. Equations (3.42) are easily programmed. 


Note: One can think of the Jacobi method as simply solving the 7-th equation 
of the system Ax = b for the i-th variable, and substituting the old values of 
the other variables to get new values of the 7-th variable. 


3.4.2. The Gauss—Seidel Method 


We now discuss the Gauss—Seidel method, or successive relaxation method. 
If in the Jacobi method, we use the new values of x; as they become available, 
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then 
2) is gi 
Fi given, 
t—-1 n 
(k+1) _ 1 S- (k+1) > (k) (3.43) 
x; = as b; = Aije; —= . ijt; 5 


for k > 0 and 1 <i <n. (We continue to assume that a; 4 0 for i = 
1,2,...,n.) The iterative method (3.43) is called the Gauss-Seidel method, 
and can be obtained in matrix form by letting M = D+ D and N = —U. 
Then 
B=M'N=~-(L+D)'U=G, 
and 
a'+1) — (4+ D)-Uae™ + (L+ D)~1b for k > 0. (3.44) 


The matrix G is called the (point) Gauss-Seidel matrix. 


Note: The Gauss-Seidel method only requires storage of 
Fee aes th oc TD Py a Pe el 


to compute ohh), The Jacobi method requires storage of 2") as well as 


+) Also, the Gauss-Seidel method generally converges faster. For these 
reasons, the Jacobi method is seldom used in practice. !? 


Example 3.11 


2 1 Ly = 3 ae 24, + r2 = 3 
is 3) eG) ce meee ee 


(The exact solution is x; = x2 = 1.) The Jacobi and Gauss-Seidel methods 
have the forms 


k+ 3 le 
pe a 
Jacobi: , 
k+ 2 1le¢ 
k+ 3 le 
ieee a 
Gauss-Seidel: 
k+ 2 1 Key 
wf =5 ef 3 +1) 


The results in Table 3.2 are obtained with 2) = (0,0)7. Observe that the 
Gauss-Seidel method converges roughly twice as fast as the Jacobi method. 
This behavior is provable. 


13 Methods that are a kind of hybrid between Jacobi and Gauss-Seidel are implemented in 
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TABLE 3.2:  Iterates of the Jacobi and 
Gauss-Seidel methods, for Example 3.11 


3.4.3 Successive Overrelaxation 


We now describe Successive OverRelazation (SOR). In the SOR method, 
one computes ght?) to be a weighted mean of a) and the Gauss-Seidel 
iterate for that element. Specifically, for o #4 0 a real parameter, the SOR 


method is given by 


g) is given 
(k+1) (k) , & < (eH) ( 29) 
k+1 k k+1 k) 
v; (l—o)ap + a b a pe , 
j= jt 


for 1 <i <n and for k > 0. The parameter ca is called a relaxation factor. If 
o <1, we call o an underrelazration factor and if 0 > 1, we call o an overre- 
laxation factor. Note that if o = 1, the Gauss-Seidel method is obtained. 


Note: For certain classes of matrices and certain o between 1 and 2, the SOR 
method converges faster than the Gauss-Seidel method. 


We can write (3.45) in the matrix form: 


(z + +p) ghtl) = — {u +(1- +p} a) +b (3.46) 
a o 


parallel processing environment, where separate processors are working simultaneously on 
separate sets of indices 7, and it may not be efficient to communicate new values as soon as 
they are computed. 
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for k = 0,1,2,..., with «© given. Thus, letting M = L + (1/c)D and 
N =-—[U + (1 —-(1/o))D], we see that (3.46) is of the form (3.34). Thus, 


B=M34N= (ob + D)-+ [((1-—o)D—oU] = Sz, 
and a4, 
aH) = 5 9) 4 ( : =D) b. (3.47) 
oO 


The matrix S, is called the SOR matrix. Note that o = 1 gives G, the 
Gauss-Seidel matrix. 


3.4.4 Convergence of the SOR, Jacobi, and Gauss—Seidel 
Methods 


We present a theorem for convergence of the SOR method, then additional 
theorems for the Jacobi and Gauss-Seidel methods. Consider first the general 
iteration equations (3.36) and (3.37) (on page 141). We have 


ght) — lk) = B(x) = glk-D), (3.48) 

since 
+) — Be® 4+¢ and a = Br 4+e, 

Also, 

(I — B)(2 — 2) = 2® — gf) (3.49) 
because 

e—-Brt+e and xt) = Brl*) +. 

Thus, 


a) —¢ = —(I- B) 1 B(x — 2) = —(1 — B)“1 BP (x) — gh)... 


which leads to the error estimates 


B = 
Jo) — al] < TE fa — a9 (3.50) 
and 
7 Bik 
Jo — al] EB) BE fe — 2] < EE a — 2], (2.51) 


assuming, of course, ||B|| < 1. 


REMARK 3.36 The iteration equation (3.4) (on page 142) corresponds 
to the iteration equation x41 = g(x) for the fixed point method in the con- 
traction mapping theorem (Theorem 2.3 on page 40), while equations (3.50) 
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and (3.51) correspond to the error bounds (2.3) and (2.2) in the conclusion 
to Theorem 2.3, respectively. The proofs of the error estimates (3.50) and 
(3.51) are completely analogous to the proofs in Theorem 2.3. Further, we 
will see a nonlinear multidimensional version of the contraction mapping the- 
orem, Theorem 8.2 on page 442, when we study the numerical solution of 
systems of nonlinear equations. The theorem even generalizes to the solution 
of integral equations; the proofs are analogous, with norms defined on the 
function spaces associated with the integral equations.!4 


REMARK 3.37 These error estimates are only helpful to the extent that 


a norm ||- || can be found such that ||B|| <1. Three natural possible norms 
are 
= ; = = T 
| Blo = max Sek Bl = max So lbjel, or Bila = y/o(BTB). 
k=1 j=l 
In the theorems to follow, we will use these norms (and || - ||2 in particular.) 


Also, the proofs sometimes take account of special properties of each method. 


We will now prove a well-known convergence theorem for the SOR method. 


THEOREM 3.15 
(Ostrowski-Reich) If A is Hermitian positive definite and 0 < a < 2, then 


the SOR method converges for any initial vector 2). 


We will need the following two lemmas. 


LEMMA 3.5 
(Stein’s Theorem) If A is Hermitian positive definite and R is annxn matrix 
such that A— R" AR is positive definite, then p(R) < 1. 


PROOF Let \ be an eigenvalue of R and u ¥ 0 be a corresponding eigen- 
vector. Then u# Au and u4(A — R” AR)u are real and positive. (Recall that 
u" Bu > 0 whenever B is positive definite.) Thus, wu’ Au > uw! R! ARu = 
(Au)# Adu = |A|?u Au. Hence, |A|? < 1. 


LEMMA 3.6 
Let A be Hermitian positive definite and suppose that A= B—C, with B 
nonsingular. Further suppose that B + B# — A is positive definite. Then 


14We illustrate norms on function spaces in Section 4.2, starting on page 191. 
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plB CG) =< 1. 


PROOF By Stein’s Theorem (Lemma 3.5), it is sufficient to show that 
Q=A-(B'C)" A(B'C) 
is positive definite. Since B~-'C = I — B~1A, we have 
Q = A-—(I- BA)" A(I — BA) 

= A-—(I-—(B™1A)#)A(I — B"A) 
= A—A+(B™1A)#A-(B™1A)# A(B-1A) + A(B“1A) 
= (B-1A)# A(B-1A)-1(B-1 A) — (B71 A)? A(B“1 A) + A(B1A) 
= (B-'A)? [B-— A+ ((B-1A)")“1A] (B“1A) 
= (B-'A)" [B- A+B] (B71A). 


But B— A+ B® is positive definite, so Q is positive definite. (Consider 
Q = E” RE with R positive definite and E nonsingular. Then 


y! Qy = y4 E" REy = 24 Rz > 0 
if z £0 and hence if y 4 0, since E is nonsingular.) 
With these two lemmas, we now proceed to prove our theorem on conver- 


gence of the SOR method. 


PROOF (of Theorem 3.15) Note that A = B—C with 
1 1 
B= G(P t+ of) and C = _ (l-9)D—oU]. 
(Recall that A= D+ D+U.) Also, 
Ss = BC =(eL+D)" [(1-o)D-oU). 


We need to show that p(S,) = p(B-!C) < 1. By Lemma 3.6, if we show 
that B + B” — A is positive definite, then we have proved that p(S,) < 1. 
(Note that B is nonsingular because B has all positive diagonal elements. The 
latter follows from A being positive definite, i-e., ef Ae; = aj; > 0.) Consider 
therefore 


2 
BER AAS SPL Shaya 
oO 


= (2 = 1) D (since L¥ =U) 


on 


<(2 —o)D. 
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Since 0 < o < 2, B+ B” — A is positive definite’. ] 


Note: By Theorem 3.15, if A is Hermitian positive definite (or symmetric 
positive definite), then the SOR method converges for 0 < o < 2, and, there- 
fore, the Gauss-Seidel method converges. However, the Jacobi method may 
not be convergent. 


Example 3.12 


laa 
A=lala 
aal 


A is positive definite for —4 <a<1. However, 


and p(J) = |2a|. Thus, the Jacobi method is only convergent for —4 <a < 4. 


We now consider additional convergence results for the Jacobi and Gauss— 
Seidel methods. We give first result without proof. 


THEOREM 3.16 

(Stein—Rosenberg theorem) Let J = —(D~!)(L+U) be a nonnegative n x n 
iteration matrix for the Jacobi method, and let G be the associated Gauss— 
Seidel matrix. Then one and only one of the only one of the following are 
valid. 


) 
(tit) 1 = p(J) = 0) 
(ww) 1 < pJ) < pY) 


Thus, in this case, the Jacobi and Gauss-Seidel methods are either both con- 
vergent or both divergent. If convergent, the Gauss—Seidel method is generally 
faster (0 < p(G) < p(J) <1). For a proof of this result, see [96]. 


a 
15Note that «4! Da = ) > dii|xi|? > 0 if 2 £ 0. 


i=l 
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Example 3.13 
—2 2 0 1 ‘ 
Let A = ( 3 3) and J = & a. Since p(J) = ./3/2 > 1, the Gauss— 


Seidel and Jacobi methods do not converge. 0 


The following define special classes of matrices related to convergence of 
iterative methods. 


DEFINITION 3.33) Ann xn complex matrix A = (a;;) is said to be 
diagonally dominant if 


n 
lai] > So lai] = wiforl <i<n. (3.52) 
izi 
If for alli,1 <i<n, the above inequality is strict, then matrix A is said to 
be strictly diagonally dominant. 


DEFINITION 3.34 For n > 2, ann xn complex matrix A is called 
reducible if there exists a permutation matriz P such that 


-1_ {An Ai 
PAP =( 0 Apo}? 


where Ay, is anr xr submatriz, 1 <r <n, and Agg is an (n—1) x (n—1r) 
submatrix. If no such permutation matrix exists, A is called irreducible. 


By Lemma 3.10 to be given later, a matrix A is irreducible if and only if its 
directed graph is strongly connected. Lemma 3.10 is generally much easier to 
use than Definition 3.34 in determining if a particular matrix is irreducible. 


DEFINITION 3.35 We say that a matrix A is irreducibly diagonally 
dominant if A is irreducible, A is diagonally dominant, and at least one of 
the inequalities in (3.52) ts strict. 


THEOREM 3.17 

Let A = (aj;) be ann x n complex matrix which is either strictly or irre- 
ducibly diagonally dominant. Then the Jacobi and Gauss—Seidel methods are 
convergent. 


To prove this, we need some additional results from matrix theory. 


LEMMA 3.7 
(Perron-Frobenius) Let B > 0 be ann x n irreducible matriz. Then 
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(i) B has a positive real eigenvalue equal to its spectral radius; 


(ti) there is an eigenvector x > 0 corresponding to \ = p(B). 


Lemma 3.7 is a classic result from matrix theory. 


LEMMA 3.8 
Let A and B be an x n matrices with0 < |B] < A. Then p(B) < p(A). 


(Here |B| = { bi; |}, where B = {bi;}-) 


PROOF Let o = p(A). For any € > 0, set 
B,=(o+6)'B 
and 
Ay = (0 +¢)"1A. 
Then |B,| < Ai and p(A1) <1, so 
0<|B,|* < Aj 0 


as k — oo. Thus, p(Bi) < 1, so p(B) < o + .€. Since € was arbitrary, 
p(B) < p(A). 


LEMMA 3.9 
Let A be strictly or irreducibly diagonally dominant. Then A is nonsingular. 


PROOF (partial): Let A= D— B be the splitting of A into its diagonal 
and off-diagonal parts. If A is strictly diagonally dominant, then D is non- 
singular, and if C = D~1B, then p(C) < ||Cl\|~o < 1. Thus l-C=D~!A 
is nonsingular, so A is nonsingular. For a proof of the nonsingularity of an 
irreducible diagonally dominant matrix, see [68]. 


We will now prove Theorem 3.17. 


PROOF = (of Theorem 3.17) By Lemma 3.9, A is invertible. Since A 
is strictly or irreducibly diagonally dominant, we also must have a; 4 0, 
1<i<n. (Otherwise, one or more rows of A would be zero and A would not 
be invertible.) The elements of the iteration matrix J = —D~!(L+U) = (B;;) 
for the Jacobi method are given by: 


(3.53) 
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If A is strictly diagonally dominant, Definition 3.33 gives 
So lbs] <1, 1sisn. (3.54) 
j=l 


Hence, ||J||o. < 1 and the Jacobi method converges. 
If A is irreducibly diagonally dominant, Definition 3.35 gives 


So lbigh <1, 1<is<n and SY  |bgj| <1 for some k. (3.55) 
Fa = 
Now consider the nonnegative matrix |B| = (|b;;|), and let A > 0 be its 


eigenvalue equal to p(|B|), whose existence is guaranteed by Lemma 3.7 (the 

Perron-Frobenius theorem). Let x > 0 be a positive eigenvector associated 

with A, and let x be normalized so that x, = max a; = 1. We then have 
<i<n 


(IBlz)p= >. \epslay = Avy =A, 80 p(|B|) = A= >> [bpjley- 
j=l j=l 
Let S = {m,m2,...,ms} be such that tm, = 1 for = 1,2,...,s8, and choose 


T = {q1,q2,---,q} such that x, <1 for r = 1,2,...,¢. We now have two 
cases to consider: 


1. Suppose T is nonempty. Clearly, +s =n and SUT = {1,2,...,n}. 
Suppose that |bm ,| = 0 for €=1,2,...,s andr =1,2,...,t. Then by 
Lemma 3.10 (on page 157 below), the iteration matrix for the Jacobi 
method is reducible, and the same subsets apply to matrix A, implying 
that A is reducible. Thus, for some r and £, |bm,q,| #0. We have then, 
letting p = me, 


n 


n 
p(|BI) Sy Beg |@5 < S- Pmesl st (since |Pmear|Par < |Omea,|)- 


a oy 
Thus, p(|Bl) < 1. 
2. T is empty. Thus, S = {1,2,...,n}. Choose p = k, so 


p(|BI) =D Pater = Se [bes | <i. 
j=l 


Thus, p(|B|) < 1. 


Now, Lemma 3.8 yields p(J) < p(|B|), and thus, p(J) < 1. 
For the Gauss-Seidel method, we have G = (I—G)~!R, where G = —D7~!L 
and R= —D~1U. Since G is strictly lower triangular, it is easily shown that 
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G”" = 0. Therefore, G = (I+G+G?+4+---+G""')R. Now it is also easily 
shown that |MN| < |M||N|. Thus, 


IG) < Z+|G| +|GP +--+ + |G) R] = 7 - |G)“ RI. 


Then, by Lemma 3.8, p(G) < p((I — |G|)~"|R|). But (J — |G|)~+|R] is the 
Gauss Seidel iteration matrix associated with the nonnegative Jacobi iteration 
matrix |B| = |G| + |R|. Therefore, from Lemma 3.8 and Theorem 3.16, 
0< pG) < pIG|) < p(B) <1. 


Before continuing, it is interesting to note that the Jacobi method may con- 
verge where the Gauss-Seidel method diverges. (As indicated by the previous 
results, such a system is unusual.) 


Example 3.14 
Consider 
XY + @3= 0 
—X%+r =0 
4 222 = 323 = 0. 


For this system, 


ft cay 47. 0 0 =1 
A=|-1 1 O|, J=-D"\(£4+U)=]|] 1 0 Of], 
12-3 1/3 2/3 0 
and 
0 0-1 
G=-(D+E)"U=]0 0 -1 
0 0 =1 


We have p(G) = 1 but p(J) < 1 (Exercise 42 on page 186 below), so the Jacobi 
method converges and the Gauss-Seidel method diverges. 


3.4.5 The Interval Gauss—Seidel Method 


The interval Gauss-Seidel method is an alternate method?® for using float- 
ing point arithmetic to obtain mathematically rigorous lower and upper bounds 
to the solution to a system of linear equations. The interval Gauss-Seidel 
method has several advantages, especially when there are uncertainties in the 
right-hand-side vector 6 that are represented in the form of relatively wide 
intervals [b,,b;], and when there are also uncertainties [@;;, ij] in the coeffi- 
cients of the matrix A. That is, we assume that the matrix is A € IR”*”, 


16to the interval version of Gaussian elimination of Section 3.3.7 on page 130 
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b € IR”, and we wish to find an interval vector (or “box”) a that bounds 
u(A,b) = {a | Ax = b for some A € A and some b € b}, (3.56) 


where IR”*” denotes the set of all n by n matrices whose entries are in- 
tervals, IIR” denotes the set of all n-vectors whose entries are intervals, and 
A € A means that each element of the point matrix A is contained in the 
corresponding element of the interval matrix A (and similarly for b € b). 
The interval Gauss-Seidel method is similar to the point Gauss—Seidel 
method as defined in (3.43) on page 144, except that, for general systems, 
we almost always precondition. In particular, let A = YA and b = Yb, 
where Y is a preconditioning matriz. We then have the preconditioned system 


YAx=Yb, ie. Ax =b. (3.57) 


We have 


THEOREM 3.18 


(The solution set for the preconditioned system contains the solution set for 
the original system.) X(A,b) C (YA, Yb) = X(A,). 


This theorem is a fairly straightforward consequence of the subdistributivity 
(Equation (1.9) on page 25) of interval arithmetic. For a proof of this and 
other facts concerning interval linear systems, see, for example, [62]. 

Analogously to the noninterval version of Gauss-Seidel iteration (3.43), the 
interval Gauss-Seidel method is given as 


1 2 i-1 n 
Oi j=l j=itl 
for i = 1,2,...,n, where a sum is interpreted to be absent if its lower index 


(0) 


is greater than its upper index, and with x; given for 1 =1,2,...,n. 


REMARK 3.38 As with the interval version of Gaussian elimination 
(Algorithm 3.5 on page 130), a common preconditioner Y for the interval 
Gauss-Seidel method is the inverse midpoint matrix Y = (m(A))~1+, where 
m(A) is the matrix whose elements are midpoints of corresponding elements 
of the interval matrix A. However, when the elements of A have particularly 


large widths, specially designed preconditioners!” may be more appropriate. 


17See [44, Chapter 3]. 
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REMARK 3.39 Point iterative methods, including the Gauss-Seidel and 
SOR methods and the conjugate gradient method explained in Section 3.4.10, 
are usually preconditioned. Note, however, that computing an inverse of a 
point matrix A leads to YA = I, where I is the identity matrix, so the sys- 
tem will already have been solved (except for, possibly, iterative refinement). 
Moreover, such point iterative methods are usually employed for very large 
systems of equations, with matrices with “0” for many elements. Although 
the elements that are 0 need not be stored, the inverse generally does not have 
0’s in any of its elements [27], so it may be impractical to even store the in- 
verse, let alone compute it.!° Thus, special approximations are used for these 
preconditioners.'? Preconditioners for the point Gauss-Seidel method, con- 
jugate gradient method, etc. are often viewed as operators that increase the 
separation between the largest eigenvalue of A and the remaining eigenvalues 
of A, rather than computing an approximate inverse. 


The following theorem tells us that the interval Gauss-Seidel method can 
be used to prove existence and uniqueness of a solution of a system of linear 
equations. 


THEOREM 3.19 

Suppose (3.58) is used, starting with initial interval vector «©, and obtaining 
interval vector «) after a number of iterations. Then, if «) C «©, for each 
A€A and each b € b, there is an x € x) such that Ax = b. 


0) 


The proof of Theorem 3.19 can be found in many places, such as in [44] or 
(62). 


Example 3.15 
Consider Ax = b, where 


_ ( (0.99, 1.01] (1.99, 2.01] [—1.01, —0.99] [—10, 10] 
ee erat eu) ales ( (0.99, 1.01] ) oe ( 1 ): 


Then,?° 


ae = G 4 »Yxm(A) = €: at 


-_ (0.97,1.03] [-0.03,0.03]\ ;_ (2.97, 3.03] 
AaYAS ese aes (0.98, 1.02] )? 8= ¥®S | (9.02, -1.98] J: 


18Of course, the inverse could be computed one row at a time, but this may still be im- 
practical for large systems. 

19Much work has appeared in the research literature on such preconditioners 

20These computations were done with the aid of INTLAB, a MATLAB toolbox available free 
of charge for non-commercial use. 
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We then have 


@ 1 _(19 97 3.93) — |-0.03,0.03]/—-10, 10 
C [2.5922, 3.4330], 
(1) 1 
(9-09: 2108 0.02, 0.02][2.5922, 3.4330 


C [—2.1313, —1.8738]. 
If we continue this process, we eventually obtain 
x4) — ((2.8215, 3.1895], [-2.1264, —1.8786])7, 
which, to four significant figures, is the same as «®). Thus, we have found 


mathematically rigorous bounds on the set of all solutions to Ar = b such 
that Ac A andbeb. 


Note: In Example 3.15, uncertainties of +0.01 are present in each element of 
the matrix and right-hand-side vector. Although the bounds produced with 
the preconditioned interval Gauss-Seidel method are not guaranteed to be the 
tightest possible with these uncertainties, they will be closer to the tightest 
possible when the uncertainties are smaller. 

Convergence of the interval Gauss-Seidel method is related closely to con- 
vergence of the point Gauss-Seidel method, through the concept of diagonal 
dominance. We give a hint of this convergence theory here. 


DEFINITION 3.36 If a= |a,@] is an interval, then the magnitude of a 
is defined to be 


mag(a) = max{|a], {a}. 
Similarly, the mignitude of a is defined to be 


mig(a) = min |al. 
aca 


Given the matrix A, form the matrix H = (h;;) such that 


mag(a,;) ifi¥ J, 
hig= 4, see 
mig(a,;;) ifi=j. 
Then, basically, the interval Gauss-Seidel method will be convergent if H is 
diagonally dominant. 
For a careful review of convergence theory for the interval Gauss-Seidel 
method and other interval methods for linear systems, see [62]. Also, see [76]. 
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3.4.6 Graph—Theoretic Properties of Matrices 


Many large matrices occurring in practice are such that most of their entries 
are 0; such matrices are called sparse matrices. To make analysis of systems 
involving sparse matrices practical, we introduce some graph-theoretic con- 
cepts. We will also use these concepts in our analysis of the SOR method. 
Finally, we have already used Lemma 3.10 in this section in the proof of 
Theorem 3.17 (diagonal dominance implies convergence of the Jacobi and 
Gauss-Seidel methods). 


DEFINITION 3.37 Consider n distinct points P,, Po, ..., Pn, which we 
will call nodes, and consider a matrix A € L(R”). For every nonzero element 
ai; of A we construct a directed path from P; to P;. Thus, we associate with 
matriz A a directed graph. We will call this the graph of the matrix. 


DEFINITION 3.38 We say a directed graph is strongly connected if, for 
any pair of nodes P;, P; there is a path P;Pe,, Pe, Pe, ..., Pe, Pj connecting 
P; and P;. 


LEMMA 3.10 
A matriz A is irreducible if and only if its directed graph is strongly connected. 


(For a proof, see [68]). 


Example 3.16 


a ; : 010 00 1 

Let A = ,B=13 0 4],andC= [0 38 0]. Then A 
oe a 02 0 5 1 0 
1101 


and B are irreducible, but C is reducible. (See Figure 3.1, and also Exercise 44 
on page 186.) 


PL D3 Pi Pi p2, 
—| e e ‘) 


4 P2 P2 P3 P3 
irreducible irreducible reducible 


FIGURE 3.1: Directed graphs for irreducible and reducible matrices. 
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LEMMA 3.11 


Annxn matrix A = (aj;) is reducible if and only if there exist two nonempty 
disjoint subsets S and T of {1,2,...,n} such that SUT = {1,2,...,n} and 
such that ifi € S andj € T then aj = 0. 


Finally, to state conditions in the theory that follows, we need the following 
two definitions. 


> 
—— eT 
p=1 p = 2 (two-cyclic) p=4 


FIGURE 3.2: Cyclic diagrams. 


DEFINITION 3.39 The matrix A has property A (or is two-cyclic) if 
there is a permutation matrix P such that 


pr (WG 
parr (2.9%) 


where D, and Dg are diagonal. Similarly, we say a graph (or matrix) is cyclic 
of index p if its directed graph is strongly connected and the greatest common 
divisor of all lengths of its closed paths is p. (See Figure 3.2.) 


REMARK 3.40 It can be shown that a matrix A is two-cyclic if its 
directed graph of its associated iteration matrix for the Jacobi method J is a 
cyclic graph of index 2, i.e., the greatest common divisor of all lengths of its 
closed paths is 2. 


DEFINITION 3.40 A matrix is consistently ordered provided the ver- 
tices of its adjacency graph can be partitioned into p sets S,, S2,..., Sp such 
that any two adjacent vertices P; and P; belong to two consecutive partitions 
S, and Sp, with k’ =k—-lifj <iandk’ =k4+1 ifj >%. (An equivalent def- 
inition is that the eigenvalues of B(a) = a~'D~'!L+aD~'U are independent 
of a fora #0.) 
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3.4.7 Convergence Rate of the SOR Method 


We now have the following question: Can we find an optimum value for o 
in the SOR method? The following theorem indicates that we can. 


THEOREM 3.20 

Let A be a realn xn matrix which has property A (that is, A is two-cyclic), is 
consistently ordered, and has nonzero diagonal elements. In addition, assume 
that the eigenvalues of the iteration matria for the Jacobi method are real and 
p(J) <1. Then for any o € (0,2), p(S) < 1, i.e., the SOR method converges. 
Moreover, there is a unique value of o = 09, given by 


2 
00 = ———S———_., 
1+ /1—(p(J))? 
for which p(So,) = ouin, P(S,) and p(So,) = o0 —1, t.e., for which the SOR 


method has optimal asymptotic rate of convergence. 


For a proof, see [68]. 


REMARK 3.41 __ It can be shown that if A is positive definite and tridi- 
agonal then A satisfies the conditions of Theorem 3.20, i.e., A has property 
A, A is consistently ordered, A has nonzero diagonal elements, and J has real 
eigenvalues with p(J) < 1. 


Example 3.17 


Let 
2 1 0 0 1/2 O 
A=|{1 2 1 and J=—j|1/2 0 1/2 |, 
0 1 2 0 1/2 O 
so i ; 
A,=0, A2.=—, A3=—-—. 
1 2 a 3 a 
Thus, 
(J) : 0.7071 
p == v0. ‘ 
V2 


By Theorem 3.20, o9 = 2/(1+./1—p2(J)) = 4- 2V2 & 1.1716, and 
p(So,) = % —1= 0.1716. In addition, 
So, =G=-(L+D)'U 
0-1/2 0 
=/|0 1/4 -1/2], 
Oiiys- Li 
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with ; 
A =0, A2=0, A=. 
2 
Thus, p(S,,) = 1/2. Note that 0 < p(Sz,) < p(Sc,) < p(J) <1. 
0.1716 0.5000 ~—-:0.7071 


3.4.8 Convergence of General Matrix Fixed Point Iterations 


There is one more useful concept. Recall that 
a+) — Be) +¢, k=0,1,2,... (3.36) 


is the general iterative method considered, and that the error in the k-th 
iterate is given as 


e(F+1) — Bre) where ¢*) = x) — a, (3.37) 


Thus, how small ||B*|| is determines how fast ¢‘”) — 0, since 
He] < [BMH 


The following result is relevant to the size of ||B*||. 


THEOREM 3.21 


For any n x n matriz B and any matrix norm || - |\, 


Jim ||B*||'/* = p(B). 


The proof of Theorem 3.21 follows from arguments in Section 5.2 (starting 
on page 297; see Exercise 5 on page 320). We also give an alternate proof, 
after an additional remark and definition. 


REMARK 3.42 From (3.37), we have |e) || < || B*|| {je ||. But for 
sufficiently large k, ||B*|| ~ p®(B) so ||e*+)|| is approximately less than 
or equal to p*(B)||€© ||. We can therefore compare the number of iterations 
expected to reduce the error below a certain tolerance 7 for two different meth- 
ods. Suppose that iteration matrices B, and Bo are different for two methods. 
Suppose that |le(*)|| < p**(By)||e|] = 7 and lel] < p'#(Bo)|le || = 7. 
Thus, to ensure that the error is below tolerance T, p*1(B,) = p*? (Bz), whence 


ky _ Inp(Be) 
ko In p(B1) ; 


Hence, the convergence rate is proportional to — In p(B), since the number of 
iterations to convergence is proportional to —1/In p(B). 
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Remark 3.42 leads us to 


DEFINITION 3.41 Let B be a convergent iteration matrix (p(B) < 1). 
The asymptotic rate of convergence of the iteration is defined by 


Roo(B) = —In p(B). 


PROOF = (of Theorem 3.21) Recall that p(B) < ||B|| for any matrix norm, 
whence [p(B)|* = p(B*) < ||B*|, that is, 


AB) <|(BE | tor. 1,2) en: (3.59) 
Now let € > 0 be arbitrary. We define 
B p(B) 
— B Sa AES 
ware % MBO) = Tarte 


This in turn gives lim,_,..(B(e))* = 0, that is, limp_.x ||(B(e))*|| = 0. Hence, 
there exists a kg, depending on e¢, such that 


_ BA lfork>k 
aoe ee 


Bie) = <i. 


I(B©)"Il = 
Thus, 
|BE/* < p(B) +e for k > ko. 
Taking the limit of the left member then gives 
Jim ||B*||"/* < p(B) +¢ 


for € > 0. Since ¢ was arbitrary, lim || B*||'/* < p(B). Combining this result 
with (3.59) gives lim | BP||t/* = p(B). 0 


3.4.9 The Block SOR Method 


We now consider briefly the Block SOR iterative method before considering 
in detail the conjugate gradient method. Consider 


Ai Ajo +++ Aw 
A= oe (3.60) 
Ay ee Ayy 


where the A;;, 1 <7i< v are square r; X r; matrices with r; > 1,1 <i< yp, 
and 
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where the entire matrix A is n by n. Corresponding to the partitioning (3.60), 
we define a block diagonal matrix D, a block lower triangular matrix L, and 
a block upper triangular matrix U by 


Au QO «.. 0 
0 Agog «+: 0 
D=|: -: ie 
0 
0 ns 0 Als 
(3.61) 
0 nae 0 0 Ajo tesa Alp 
L= Ao 0 ie 
“3 : Api 
Ay ) Ay »v—1 0 0 i) 0 


Assume that D is nonsingular. Then (D+ cL) is invertible for all o. If we 
partition x and b consistently with (3.60), then Av = 6 can be written as 


Ay «+ Aw Beal by 
Ao, ++: Ag ip) be 
Ay wee Aw Ly by 


That is, ee Ajjvj; = b;, 1 <i <v. We now obtain the block SOR method 
defined by 


Aga*? =(1- 0) Aas” (3.62) 


t—1 Vy 
+o<4 — Ne Agaht) = se Aja + b; 
j=l j=i4l1 


for k = 0,1,2,... and1<i<v, with 2 given. 
REMARK 3.43 If o = 1, we obtain the block Gauss-Seidel method. 0 


REMARK 3.44 If, for example, D is positive definite and A is positive 
definite, then the block SOR method converges for 0 < a < 2. 


Block methods allow us to efficiently handle systems in which the pattern 
of nonzeros is in a block structure, such as in discretizations of various types 
of partial differential equations. Block methods can lead to efficient use of 
resources in parallel computing, etc. 
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3.4.10 The Conjugate Gradient Method 


We now consider the conjugate gradient method. It is assumed here that A 
is symmetric positive definite. This method is well-suited for sparse matrices 
(many zero elements). Consider the linear system Ax = b and consider the 
corresponding quadratic functional 


1 n 
F(v) = 2 x Aig ViVj — yn Ui= 5(Av ,v) — (0, v), (3.63) 
tj= 
where (x,y) = 27 y. Since A is real symmetric positive definite, 
(Av, v) > Oifv 40. (3.64) 


Now consider the derivative of F'(v) with respect to v;. Then 


OF 
Av, = S- Ajj; — bj. (3.65) 


First, recall that the residual vector r for v is defined by 
r= Av—b), (3.66) 


whence 


VF=r=Av-b. (3.67) 
This leads to 


THEOREM 3.22 
The solution to Ax = b (assuming that A is symmetric positive definite) is 
equivalent to the problem of finding the minimum of the quadratic functional 


F(v) of (3.63). 


PROOF The solution of Ax = 6 has zero residual vector r = Ar — 6 = 0. 
But if v minimizes (3.63), a necessary condition is VF(v) = 0. Thus, one such 
v is clearly v = x, since VF (x) = r = 0. But since A is positive definite, F(v) 
has one and only one such critical point. In addition, x gives a minimum 
of F(v). (To see this, consider F(z + Ax) = F(x) + $(AAz, Ax). Thus, 
F(x) < F(x+ Az) for any Ax 4 0.) Conversely, if x is a minimum of F’, then 
by (3.67), Ax — b = 0, or x is the solution of Ax = b. 


The conjugate gradient method finds the minimum x of F in at most n 
steps. In the conjugate gradient method, v is a given vector, and we choose a 
vector p # 0. We then let v’ = v+tp, where t is specified later. For fixed v 
and specified p, F(v’) is a quadratic function of t only. It is given by 


F(v') = F(v+tp) = 5(Av.0) +t(Av,p) + 5t?(Ap. p) — (b,v) — t(b, p), 
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or, using (3.63) and (3.66), 
1 
F(v') = a (Ap, p) + t(r, p) + F(v). (3.68) 


We now choose ¢ so that if we start at v and go in direction p, then v’ will 


dF 
minimize F’. A necessary condition is —- = t(Ap, p) + (r,p) = 0 or 


dt 
iy (7 p) 
t=—- : 3.69 
(Ap, p) as 
oF i 
Since ae (Ap, p) > 0, choosing t = t produces a minimum of F along the 


relaxation direction p. The point 0 = v + fp is called a minimum point. 
We now describe the conjugate gradient method procedure. 


Step 1. Choose initial vector v© and direction 


p) = —r© = —grad(F(v)) = —Av® +. 


Then 
yh) = yO — tyr, 


where 
(r ; r(0)) _ (r© , p) 


= Tar), r@) ~~ (ap, pw@y” 


Step 2. We are given v\). We now choose p®) as a linear combination of 
r® and pO, (-rY = —Av +b = -—VF(v).) We let 


Perse gOe past: 


where ji; is chosen so that (Ap), p)) = (p@), Ap) = 0. Thus, 


ej ea) 
(p®), Ap®) 


We now proceed along p?) in the usual manner to a minimum point. 
After the k-th step, our directions and residual vectors satisfy the following 


for, k = 2,3,4,.... 


7 — (rb-D y(*) 
v® = v&-)D + tpp™, where t, = ae (3.70) 
r*) = Ay) —b =r) +t, Ap), (3:71) 
(k-1) Ap(k-1) 
ee!) (3.72) 


(k) _ _,.(k-1) (k-1) a ee 
- : TARP paeere Mi (pO@-D, Ap@-D) 
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We have the following result: 


PROPOSITION 3.11 
(r®) r&-D) = (r®  p&-YD) = (r®, np) = 0 for k = 2,3,.... In addition, 
(rf) ©) =0 and (rp) = 0. 


PROOF 
—(r) , p)) = (rr, 7) = Can —b,r' °)) 
= (Av — ¢,Ar© — b,r) 
= (r +b — t,Ar© — b,r) 
— (7 | 7) _ t1(Ar© , 7) =0. 
The proof now proceeds by induction. Consider the k-th step. By (3.71), 
ee ak 
since ty = —(r%—), p))/(Ap™, p)). By (3.71) and (3.72), 
(pe 5 aie oe + ty(Ap®), p*-))) 
= te[(—Ar®—) + pg, _1 Ap), p®—-Y)] 
= te(—Ar®—DY, p®D) + tepyn—1(ApOY, p&) 
= 0 by definition of uz_}. 
Finally, by what we just showed and (3.72), 
C= (GP r= — FO Oy pee 2): 
Thus, (r(), r@-)) = 0, 


REMARK 3.45 — px_1 and ty are well-defined as long as (Ap, p®) #0 
But since A is positive definite (Ap,p™) > 0 for p™ 4 0. If p™ = 0, 
then by (3.72), r*®-) = p_1p-) and hence since (r*-) ,p-)) = 0, 


wm 


(r®-) -(-1)) — 0, and thus r‘*-) = 0. This indicates that 0 = r’— 
Av*-)) — b or that v“-») is the solution. 


We now have: 


THEOREM 3.23 
In the conjugate gradient method, the residual vectors r, k = 0,1,2,... 
form an orthogonal system, i.e., (r,rD) =0 for iF j. 


PROOF (By Induction) Assume that after the k-th step we have 
(r 79) =0 fori#j,0<i,j<k (3.73) 
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(p©, Ap) = 0 fori # j,0< i,j <k. (3.74) 


By the induction hypothesis, (3.73) holds for k = 1, while (3.74) holds also 
for k = 1, because p) can be set equal to 0. We need to prove the following: 


(pt), Ap) = 0 for 7 =1,2,...,k and (3.75) 


(r®) pO) = 0 for j = 0,1,2,...,k. (3.76) 


For j = k, (3.75) is satisfied by construction of direction p+) as seen by 
(3.72). By (3.72) and (3.74), we have for 1 < j < k that 


(p+), Ap) = —(r), Ap) 4 ur(p™), Ap") = —(r), Ap). 
But by (3.71), 
Ge) 1 
Ap?) = ——., so (pt) Ap) = —=[(r,r) — (r) pG-D)) = 0 
J J 


by (3.73). Thus, (3.75) holds for 1 < j <k. 
Now consider (3.76). (3.76) holds for 7 = k by the previous proposition. 
For 0 < j < k, we note that by (3.71) and (3.73), 


(nF) (9) = (r), )) a tea(Ap®t), r) (3.77) 
= te1(Ap@t), r) 


Using (3.72) with 7 in place of k — 1, we have for 1 < j <k, 
(r #97) = tay (Ape, pO) + j(ApY, p)), 


By (3.75), which we just proved, we have (3.76) for 1 < 7 < k. Finally, for 
j =0, 7 = —p and by (3.71), (3.73), and (3.75), 


RE )) = tpi Ap, 2) = — teas (Ap OT? pp) = 0, 


U 


We now have the following remarkable result that the conjugate gradient 
method is actually a finite procedure. 


COROLLARY 3.1 
The conjugate gradient method yields the solution of Ax = b in at most n 
steps. 


PROOF By Theorem 3.23, the residual vectors r™,k > 0, form an 
orthogonal system. But these vectors belong to R” and hence contain at 
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most n nonzero vectors. Thus, by the n-th step, r) = 0 and thus vu”) 
satisifes Au) — b= r™ =0, ie., v\™ is the solution of Ax = b. 


REMARK 3.46 = Although the solution should be obtained in at most n 
steps, due to rounding error, the residual vector r(”) may be nonzero. Calcu- 
lations may therefore be continued beyond the n-th step. Also, especially for 
n large, the iterations may yield an approximate solution of sufficient accuracy 
before the n-th step. 


REMARK 3.47 This method is very attractive for sparse positive definite 
matrices A. The bulk of the work in each step is expended in calculating Ap™, 
a matrix-vector product which requires O(n?) computations, but this work is 
much reduced if A is sparse. For example, if each row of A contains a nonzero 
entries, then Ap”) requires O(an) work. 


3.4.11 Iterative Methods for Matrix Inversion 


We now briefly consider iterative methods for matrix inversion. However, 
in practice, it is usually not necessary to find the inverse of a matrix. Except 
when the matrix inverse is explicitly required,”! it is faster to use a direct 
method such as Gaussian elimination or an iterative method such as the SOR 
method to solve a single system of equations. We nonetheless consider two 
iterative methods for matrix inversion. 


THEOREM 3.24 

(Iterative method 1) Let A be ann x n nonsingular matrix. Suppose that C 
is ann xn matrix such that ||I — AC]|| <q < 1. Then for any nonsingular 
matrix Xo, the sequence 


X41 = X,B4+C, where B=I— AC (3.78) 
converges to A~', with error estimates 


a 
— =lX — Xo||, k=1,2,.... (3.79) 


A ee a A 
| Xx | < 7a e-all $F 


PROOF By hypothesis, ||Bl| <q < 1. Thus, 


(I- By =1+ B+---+B"+-.-= 5° Bei. 
j=0 


21such as for sensitivity analysis or in preconditioning an interval method 
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Note that (AC)~ = )°° BY. Since A and AC are nonsingular, C is non- 
singular. 
Now consider X, = X,-1B+C. Thus, 


X, = XpB+C,Xo=X1B+C = X)B°+CB4+C 
k-1 
and, in general, X; =C S— BY + XoB* for k = 1,2,.... 
j=0 
_ th = ——_ -1_ -1 
Hence, X = lim Xx = ox Bi = C(AC)"1 = A™}. 
jJ= 
In addition, since A~t = A~'B+C, we have Xp41 — A7! = (X,—A7!)B. 
Thus, (X%41-A7!)—(X,—-A7+) = (X,—-A7+)(B-1I) and hence (X;,—A7!) = 
(X_ — Xx41)(1 — B)7!. Therefore, 
|Xx — Xe+i| 


X,—-A73||< 
|Xx ES i-g 


(3.80) 


1 1 
(Recall that ||(I — B)~1|| < T- |] —— 7s 7 Also, (Xk41 — Xk) = (Xk - 


X,-1)B (since X441 = BX, +C and X; = BX. +C). Hence, 
| Xkt1 — Xxl| < || Xx — Xe-all- (3.81) 


Inequalities (3.80) and (3.81) finally give 


k 
|X. -— A74|| < Tale — Xp-1l| < rere — Xoll. 


THEOREM 3.25 
(Iterative method 2 — actually a Newton method) Let A be nonsingular. 
Suppose that ||Ro|| = || — AXo|| <q <1. Then the iterates defined by 


Xpai = Xzp(I+ Re), Re =I -—AXy, k =0,1,2,... (3.82) 
converge to A~! with error estimates 


Xz —- A-t]] < olin AX, || < rea (3.83) 


PROOF With the above notation 


Ry =1I-AX,=I1I- AX,-1(1 + Rr-1) = Rp — AXp_-1 Rp_-1 = (3.84) 
= (I — AX,-1)Re-1 = RR_4. 
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Thus, 
Ry = (Ro)? for k = 0,1,2,... (3.85) 


In addition, as ||Ro|| < 1, || Rx|| ~ 0 as k — co and X;, = A7!(I— Rx) — Aw! 
as k > oo. Also, Ro = I — AXo and thus A~! = Xo(I — Ro)~! gives 


|| Xoll 


Wn : 
ays (3.86) 


Therefore, 


a = S x : 
|X~ —AW* |] < ATT — AXe |] = ATM Rell < Pot ral (3.87) 
< |Xoll a 
Lessig 


3.4.12 Krylov Subspace Methods 


We now briefly consider Krylov subspace methods for approximate solution 
of Ar = b, although these methods are also useful for finding eigenvalues 
(Chapter 5). It is assumed in the following that A is n x n and nonsingular. 
References for Krylov subspace methods are [23] or [78]. We begin with 


DEFINITION 3.42 A Krylov subspace is a subspace of the form 
Km(A,v) = span(v, Av, A?v,..., A”~1v). 


To motivate Krylov subspace methods, consider the case y; = b, y2 = 
Ay, = Ab, ..., Yn = AYn—1 = A°7!y. = A”~1D. Let K be the n x n matrix 


K = (yi, 92,--+5Yn)- 


Then AK = (Ay, Aya,...,Ayn) = (y2,y3,---,A%yi). Assume that K is 
nonsingular and let c= —K~!A"y;. Then AK = K[e2,€3,...en, —c]. Thus, 


0 0 +: 0 -c 
Ls A. «3 0 —co 
0 1 0 ++ 0 ~e3 
Keak == 0 oo: (3.88) 
0 
0 0 0 1 -Cp 
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(This matrix is called a companion matrix.) Note that C is upper Hessen- 
berg?? and C has the same eigenvalues as A because, if Az = Az, then 
Cy = Ay where y = K~'z. Also, the characteristic polynomial of C is very 
simple: 


p(a) = —det(C — aI) =a" 4+ S- oa. 
i=1 


Thus, the eigenvalues of A can be found by finding the zeros of p(z). 

Unfortunately, this technique is not useful in practice, since finding c re- 
quires computing A”y; and then solving Kc = A”y;. The matrix K is likely 
to be ill-conditioned so c would be inaccurately computed.?? 

To overcome these problems, we will replace kK by an orthogonal matrix Q, 
such that for all m, the leading m columns of K and Q span the same space. 
In contrast to K, Q is well-conditioned and easy to invert. Furthermore, we 
may compute only as many columns as necessary of @ to get an accurate 
solution. Generally, less columns are required than matrix dimension n. 

Thus, let 


K = QR (the QR decomposition of K). 
Then K-!AK = R-'QT AQR = C implying that 
QTAQ = RCR“' = H. 


Since R is upper triangular and C' is upper Hessenberg, it follows that H 
is upper Hessenberg. Hence, Q7 AQ = H where Q is unitary and H upper 
Hessenberg. 


REMARK 3.48 | If A is real symmetric, then H is also symmetric, so H 
must be tridiagonal. In this case, we write Q7 AQ = T, where T is tridiagonal. 


Now consider computation of Q. Let Q = (q%1,q2,---;4n). Since Q7 AQ 
H, AQ = QH, and we equate column j on both sides to obtain Aq; = 


j+1 
ar hij Gi Hence, 


j+1 
Gn Aq = S- hij Gn di =hm,j forl<m<j (3.89) 


i=l 


22 An upper Hessenberg matrix is a matrix A = [a;j;] that has nonzeros only on and above 
the diagonal and in entries immediately below the diagonal; that is, for an upper Hessenberg 
matrix, aj; A 0 only if 7 >i—1. 

?3The columns of K can be related to successive iterates in the power method (see Sec- 
tion 5.2 on page 297), so y; and y;+1 are likely to be close for 7 large. 
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and 
j 
hj4199j+1 = Aq — S- hij Gi- (3.90) 
i=l 


By (3.89) and (3.90), we obtain the following algorithm for finding Q and H. 


ALGORITHM 3.7 

(Arnoldi’s Algorithm) 
INPUT: A, b, the number of desired columns k, and a zero tolerance e€. 
OUTPUT: the first k columns of Q and H. 


1. a1 — b/|[dll. 
2. FORjJ=1 tok 
(a) 2 — Aq;. 
we Zz. 
(b) FORi=1 to j 
t hig qi wu. 
h. z2—2z—hisg.- 
END FOR 


() Aj4ig — [lzlle- 
(d) IF |hj41,;| <¢ THEN RETURN. 


(e) Qjt+1 — 2fhj41,3- 
END FOR 
END ALGORITHM 3.7. 


REMARK 3.49 _— Arnoldi’s algorithm is also called a modified Gram- 
Schmidt algorithm. Components in directions q, to q; are subtracted from z, 
leaving z orthogonal to them. 


REMARK 3.50 Suppose we use Arnoldi’s algorithm to compute k columns 


of Q. Let Q i (Qk; Qp) where Qk = (q1, G2,-+55 dk) and Qp = (dk41; sey dn) 
and where @, is unknown. 


Then 
Q7AQk ia 
(3.91) 


Al = Q* AQ = (Qk, Qn)’ A(Qk, Qp) i ( 
QrAQ: Q5AQp 


k n—k 


A; Ayr k 
~ \ Agp Hy } n-k 


172 Classical and Modern Numerical Analysis 


Note that H;, is upper Hessenberg and Hx, has a single (possibly) nonzero 
entry in its upper right corner, namely, hx41,.. Hz and Hxp are known but 
Hy, and Hy are unknown. 

Also, note that A(Qxz,Qp) = (Qk, Qp)H, so 


AQk = QrHe + QpHkp = QeHe + het1,kb+1€4- (3.92) 


REMARK 3.51 Suppose that A issymmetric. Then H = T is symmetric 
and tridiagonal. 


ay By 0 -s- 0 
Bi ag Bo +s: 0 
Let T= | 0 ‘ ; (3.93) 
Bn—1 
O48 Oo Bear ae 


Equating column j on both sides of AQ = QT yields 
Ag; = 8j-19j-1 + 0595 + A595 41- 


In addition, q; Aq = aj. This leads to the Lanczos algorithm for A real 


symmetric. 1] 


ALGORITHM 3.8 

(Lanczos Algorithm) 
INPUT: A, 6, and the number of desired columns k, and a zero tolerance e. 
OUTPUT: the first & columns of Q and T. 


1. q — b/|bll2, Bo — 0, Go — 0. 
2, FOR j=1ltok 

(a) 2 — Aq;. 

(b) aj — qf z. 

(¢) 22050; —PjAdj-1- 


(d) Bj — |l2lle. 
(e) IF |@;| <¢ THEN RETURN. 


(f) doa — 2/85. 


END FOR 
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END ALGORITHM 3.8. 


After k steps of the Lanczos algorithm, we have 


k n—-k 
The 4 PA PA 
r=( F m= tag = _ e oe as a (3.94) 
Te, n-k \QPAQk QP AQ, 


Because A is symmetric, we know T;, and Tx and Trp = ie but we don’t 
know T,. T,x has a single (possibly) nonzero element in the upper right corner, 
Br 

Now consider approximate solution of Ax = b using Krylov subspace tech- 
niques. Consider K,,(A,ro) = span(ro, Aro,..., A™~+ro), where ro = b— Azo. 
Ax = b is solved approximately by seeking a solution zm, = %o + Wm, where 
Wm € Ky(A,70) and b— Atm L Km(A, 70). 

Let qi = ro/||rollg be the starting vector in Arnoldi’s algorithm and @ = 
llrolle. Then 


Qi, AQm = Hm and QF,ro = Qi, (Bu) = Ber. 
Then 


3.95 
Va = H (Ge1) or Hindi = Bers ( ) 


This method is called the Full Orthogonalization Method (FOM). To verify 
that b— Atm, L Ky(A,1ro) consider the following: 


We have Av, — b= Arg — b+ AQmYm 
= —1o + AQmH,,| (Ber) 


ie = XO + QmYm 


Thus, Q7,[Atm — 5] = Q7,[-ro + AQmH;,' (e1)] 
= -Qr ro + Bex = 0. 
Hence, Av, — 6 L Km. 


We also have the following error result: 


PROPOSITION 3.12 
The residual vector b— Atm, computed using the FOM satisfies ||b— Avm||2 = 
|Prottm| [ern Yr 


PROOF 
We have b — Atm = b— Ary — AQmYm 
aa Ba = QmAmYym 7 hinge Gide een Oh 


= Bq — Bq — hintiymen Ym Gm+1 
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Thus, ||b — Atm||2 = |Pon-+tiyml le yml- 


REMARK 3.52 = The conjugate gradient method (for symmetric posi- 
tive definite linear systems) is an orthogonal projection technique onto the 
Krylov subspace K,,(r9,A), where ro is the initial residual. It is therefore 
mathematically equivalent to FOM. However, because A is symmetric, some 
simplifications resulting from the three-term Lanczos recurrence leads to a 
more elegant algorithm. J 


3.5 The Singular Value Decomposition 


The singular value decomposition, which we will abbreviate “SVD,” is not 
always the most efficient way of analyzing a linear system, but is extremely 
flexible, and is sometimes used in signal processing (smoothing), sensitivity 
analysis, statistical analysis, etc., especially if a large amount of information 
about the numerical properties of the system is desired. The major libraries 
for programmers (e.g. Lapack) and software systems (e.g. MATLAB, Math- 
ematica) have facilities for computing the SVD. The SVD is often used in 
the context of a QR factorization, but the component matrices in an SVD 
are computed with an iterative technique related to techniques for computing 
eigenvalues and eigenvectors (in Chapter 5 of this book). 

The following theorem defines the SVD. 


THEOREM 3.26 

Let A € L(R",R™) be otherwise arbitrary. Then there are orthogonal matri- 
ces U and V and a matriz & = [D;;] € L(R",R™) such that X,; = 0 fori F¥ j, 
Yi =i > 0 for 1 <i<p=min{m,n}, and 0, > 02 >---> op, such that 


A=UXV!. 


For a proof and further explanation, see G. W. Stewart, Introduction to 
Matrix Computations [85| or G. H. Golub*4 and C. F. van Loan, Matrix 
Computations [34]. 

Note: The SVD for a particular matrix is not necessarily unique. 
Note: The SVD is defined similarly for complex matrices A € L(C",C™). 


24Gene Golub, a famous numerical analyst, a professor of Computer Science and, for many 
years, department chairman, at Stanford University, invented the efficient algorithm used 
today for computing the singular value decomposition. 
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REMARK 3.53 _ A simple algorithm to find the singular-value decom- 
position is: (1) find the nonzero eigenvalues of A’ A, i.e., \;,7 = 1,2,...,7, 
(2) find the orthogonal eigenvectors of A? A and arrange them in n x n ma- 
trix V, (3) form the m x n matrix © with diagonal entries 0; = Xi, (4) 
let uj = a, Auj,t = 1,2,...r and compute uj,i = r+1,r+2,...,m using 
Gram-Schmidt orthogonalization. However, a well-known efficient method 
for computing the SVD is the Golub-Reinsch algorithm [86] which employs 
Householder bidiagonalization and a variant of the QR method. 


Example 3.18 


12 
Let A= |34]. Then 
56 
—0.2298 0.8835 0.4082 9.5255 0 
U x | —0.5247 0.2408 —0.8165 |], Ux 00.5143 }] , and 
—0.8196 —0.4019 0.4082 0 0 


ye —0.6196 —0.7849 
™ \ —0.7849 0.6196 


) is a singular value decomposition of A. 


U 


Note: If A = UNV" represents a singular value decomposition of A, then, 
for A= A’, A=VxTUT represents a singular value decomposition for A. 
Let U(:,i) denote the i-th column of U and V(:,7) denote the i-th column 
of V. Then AV(:, 2) = o{U(:,7) for 1 <i < p, and, ifn > m, then AV(:,7) =0 
forp+1<i< 7, that is {V(:, 8) bpp form a basis for the null space of A. 


DEFINITION 3.43 = The vectors V(:,i), 1 <i <p are called the right 
singular vectors of A, while the corresponding U(:,1) are called the left singular 
vectors of A corresponding to the singular values o;. 


The singular values are like eigenvalues, and the singular vectors are like 
eigenvectors. In fact, we have 


THEOREM 3.27 

Suppose A € L(R") be symmetric and positive definite. Let {\;}\_, be the 
eigenvalues of A, ordered so that Ay > rA2 > +--+ > An, and let v; be the 
eigenvector corresponding to ;. Furthermore, choose the v; so {u;};_, is an 
orthonormal set, and form V = [v1,-+- ,Un| and A = diag(A1,--- ,An). Then 
A=VAV? represents a singular value decomposition of A. 


This theorem follows directly from the definition of the SVD. 
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We also have 


THEOREM 3.28 
Let A € L(R") be invertible, and let A= USV" represent a singular value 
decomposition of A. Then the condition number k2(A) = 01/on. 


PROOF = k«2(A) = |/All2||A~*||2, while || Al]2 = fae || Az||o. However, 


|| Allo = |UX(V7 2) |\o = ||Swl2, where w = Vio, 


since ||Uz||2 = ||z||2 for every z € R” (because U is orthogonal). Also, since 
V is orthogonal, ||w||2 = 1 if ||z|]2 = 1, and for each x € R”, there is a 
corresponding w, « = Vw. Thus, 


|| Az|l2 = max |/Xw|| =o1. 
[|w||2=1 
Now, observe that 
AMS VS TUE: 


where )~+ = diag(1/o1,--- ,1/on). Since transposes of orthogonal matrices 
are orthogonal, a similar argument to that for ||Al]2 shows that ||A7+||2 = 
1/on. Therefore, 

K2(A) = ||All2||A~*ll2 = o1(1/on). 


U 


Thus, the condition number of a matrix is obtainable directly from the 
SVD, but the SVD gives us more useful information about the sensitivity of 
solutions than just that single number, as we’ll see shortly. 

The singular value decomposition is related directly to the Moore—Penrose 
pseudo-inverse. In fact, the pseudo-inverse can be defined directly in terms 
of the singular value decomposition. 


DEFINITION 3.44 Let A € L(R",R™), let A = UV" represent a 
singular value decomposition of A, and assume r <p is such that a0, > 02 > 
or > 0, and Op41 = Or42 = +++ = Op = 0. Then the Moore—Penrose pseudo- 
inverse of A is defined to be 


At=v»tu’, 


where St = (Uj) € L(R™,R”) is such that 


ig 
Y= l/o, ifl<ic<r. 


7 


0 ifi Aj ori>r, and 
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Part of the power of the singular value decomposition comes from the fol- 
lowing. 


THEOREM 3.29 
Suppose A € L(R”,R™) and we wish to find approximate solutions to Ax = b, 
where be R™. Then, 


e If Ax = b is inconsistent, then x = Atb represents the least squares 
solution of minimum 2-norm. 


e If A is consistent (but possibly underdetermined) then x = ATb repre- 
sents the solution of minimum 2-norm. 


e In general, x = Atb represents the least squares solution to Ax = b of 
minimum norm. 


The proof of Theorem 3.29 is left as an exercise (on page 188). 


REMARK 3.54 _ If m <n, one would expect the system to be underde- 
termined but full rank. In that case, A*b gives the solution x such that ||zx|l2 
is minimum; however, if A were also inconsistent, then there would be many 
least squares solutions, and Atb would be the least squares solution of min- 
imum norm. Similarly, ifm > n, one would expect there to be a single least 
squares solution; however, if the rank of A is r < p = n, then there would 
be many such least squares solutions, and Atb would be the least squares 
solution of minimum norm. 


Example 3.19 


12 3 -1 
Consider Az = b, where A= | 4 5 6 ] andb= 0 |. Then 
789 1 
—0.2148 0.8872 0.4082 16.8481 0 0 
Ux | —0.5206 0.2496 —0.8165 |, S& 0 1.0684 0], 
—0.8263 —0.3879 0.4082 0 0 0.0000 
—0.4797 —0.7767 —0.4082 0.0594 0 0 
V x | —0.5724 —0.0757 0.8165 ], and + = 0 0.9360 0 
—0.6651 0.6253 —0.4082 0 0 0 


Since o3 = 0, we note that the system is not of full rank, so it could be either 
inconsistent or underdetermined. We compute x ~ (0.9444, 0.1111, —0.7222]”, 
and we obtain?® || Az — b||z ~ 2.5 x 10-1. Thus, Ax = b, although apparently 


25The computations in this example were done using MATLAB, and were thus done in IEEE 
double precision. The digits displayed here are the results from that computation, rounded 
to four significant decimal digits with MATLAB’s intrinsic display routines. 
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underdetermined, is apparently consistent, and x represents that solution of 
Ax = b which has minimum 2-norm. 


As with other methods for computing solutions, we usually do not form the 
pseudo-inverse At to compute Atz, but we use the following. 


ALGORITHM 3.9 
(Computing At b) 
INPUT: 


(a) the m by n matrix A € L(R",R"™), 


(b) the right-hand-side vector b € R™, 


(c) a tolerance € such that a singular value o; is considered to be equal to 
0 if o;/o1 <e. 


OUTPUT: an approximation x to ATb. 


1. Compute the SVD of A, that is, compute approximations to U € L(R™), 
y € L(R",R™), and V € L(R") such that A=UXV". 


2. p~min{m, n}. 
8r oD. 
4. FORi=1 to p. 


IF o;/o, > € THEN 
a 


ELSE 
rei. 
ui. EXIT FOR 
END IF 
END FOR 
5. Compute w = (wi,::+, wr)? € R", w — U(:,1: r)Pb, where U(:,1: 
r) € L(R",R”) is the matrix whose columns are the first r columns of 
U. 


a; —1/o;. 


6. FORi=1 tor: wi of uy. 
eo So wiV(;, 4). 
i=l 
END ALGORITHM 3.9. 


REMARK 3.55 __ Ill-conditioning (i.e., sensitivity to roundoff error) in the 
computations in Algorithm 3.9 occurs when small singular values o; are used. 
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For example, suppose o;/01 © 10~®, and there is an error 5U(:,7) in the vector 
b, that is, b = b—dU(:, 7) (that is, we perturb b by 6 in the direction of U(:,7)). 
Then, instead of AT), 


At (b+ 6U(:,4)) = Atb + ATSU(:,@) = Atb+ b-V (i). (3.96) 


Thus, the norm of the error dU(:,7) is magnified by 1/o;. Now, if, in addition, 
b happened to be in the direction of U(:,1), that is, 6 = 6,U(:,1), then 
|AT blo = |]611/o1)V(:, 1I)|l2 = (1/o1)||b|/2.. Thus, the relative error, in this 
case, would be magnified by o1/o;. 


In view of Remark 3.55, we are led to consider modifying the problem 
slightly to reduce the sensitivity to roundoff error. For example, suppose 
that we are data fitting, with m data points (¢;,y;) (as in Section 3.3.8 on 
page 139), and A is the matrix as in Equation (3.26), where m >> n. Then 
we assume there is some error in the right-hand-side vector b. However, since 
{U(:,7)} forms an orthonormal basis for R™, 


b= Sy B;U(:,2) for some coefficients {3;};",. 
i=1 


Therefore, U"b = ((31,..., Gm)", and we see that x will be more sensitive to 
changes in components of 6 in the direction of the @; with larger indices. If we 
know that typical errors in the data are on the order of €, then, intuitively, it 
makes sense not to use components of b in which the magnification of errors 
will be larger than that. That is, it makes sense in such cases to choose € = € 
in Algorithm 3.9. 

Use of « # 0 in Algorithm 3.9 can be viewed as replacing the smallest 
singular values of the matrix A by 0. In the case that A € L(R”) is square and 
only oy is replaced by zero, this amounts to replacing an ill-conditioned matrix 
A by a matrix that is exactly singular. One (of many possible) theorems 
dealing with this replacement process is 


THEOREM 3.30 

Suppose A € L(R"), and we replace on # 0 by 0, then form A = USV", 
where A= USXV"™ represents the singular value decomposition of A, and S = 
diag(o1,--+ ,0n—1,0). Then 


|A-All2= min ||A—- Ble, 
BEL(R") 
rank(B)<n 


(The proof is left as Exercise 59 on page 189.) 
Suppose now that A has been obtained from A by replacing the smallest 
singular values of A by 0, so the nonzero singular values of A are 0; > 02 > 
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--» > oy > 0, and define x = A+b. Then, perturbations of size ||Ab|| in b 
result in perturbations of size at most (o1/¢;,)||Ab|| in «. This prompts us to 
define a generalization of condition number as follows. 


DEFINITION 3.45 Let A € L(R™,R"), with m and n arbitrary, and 
assume the nonzero singular values of A are 0, > 02 >-:: >o0, >0. Then 
the generalized condition number of A is o1/0,. 


3.6 


. Define an n x n real matrix A by aij = w 


Exercises 


. Show that the inverse of a positive definite matrix is positive definite. 


Prove that the diagonal elements of a symmetric positive definite n x n 
matrix are positive. 


A matrix A = (a;;) of size n x n is said to be skew-symmetric if A? = 
—A. Prove the following properties of a skew-symmetric matrix. 


(a) aj, =0fori=1,...,n. 


(b) I — A is nonsingular, where J is the n x n identity matrix. 


7 Wi for n real vectors 


W1,W2,---,Wn. Prove that matrix A is real, symmetric, and positive 
semi-definite. 


. Show that (a), (b), and (c) in Definition 3.19 are satisfied for the second 


inner product in Example 3.5 on page 92. 


. Prove that, if A is anormal matrix, ie., AY A= AA” then || All2 = p(A). 


(Note that if A is real symmetric or Hermitian, then A is normal.) 


. Let A be a nonsingular n x n real matrix and ||A~1B|| = r < 1, where 


the matrix norm is induced from some vector norm. 


|A~* 


(a) Show that A+ B is nonsingular and ||(A + B)~1|| < ; 
-r 


BUNAT* 1? 
i 


r 


(b) Show that ||(A + B)~! — A74|| < 


. Find nonzero constants: 


(a) dy and dz such that d,||z||1 < ||a||2 < de||a||, for all e € R”. Find 
d, and dz so each inequality is sharp for some x. 


10. 
11. 


12. 


13. 


14. 


15. 
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(b) by and be such that by ||2]|1 < ||a|Joo < be||az||1 for all « € R”. Find 
b; and bz so each inequality is sharp for some x. 


(c) Compare with Proposition 3.2. 


. Let ||.||4 and |].||g be two vector norms for R”. Suppose that jim \|z— 
— 00 


Xx|| = 0. Prove that given € > 0, there is an N such that ||z—zz|la < € 
when k > N. 


For an x n matrix A, show that, ||A||1 <n1/?||All2 < nl|Alloo- 


Let 


Find ||Alli,||Alloo; ||All2, and p(A). Verity that p(A) < ||All1, (A) < 
||Alloc and p(A) < ||All2. 


Suppose that matrix A is nonsingular, x is the solution to Ax = 8, 
suppose ||A~*||2 = 10°, and suppose ||A||2 = 107. We wish to solve 
Bz =b where B= A—C and ||C||z = 107+. 

(a) Prove that B is nonsingular. 

(b) Find an upper bound on ||xz — z||2 in terms of ||a||2, that is, find a 

constant c > 0 such that ||a — z\/2 <c ||a|Jo. 
Let A be n x n tridiagonal matrix, with 
5 ifi=j, 
aig = 1 if*t=jtlori= 7-1, 


0 otherwise. 


(a) Show that A is nonsingular. 


1 1 
(b) Show that = < ||A™"Ilo < 3. 
Suppose ||I — ABo|| = c <1 and 
By = Be-1 + Be-i1(l — ABg-1), 


k=1,2 


(a) Show that || — ABg|| < c2. 
gk 


c 
l-c¢ 


(b) Show that ||A-! — Bal] < ||Bol 


Prove that if A is an eigenvalue of A and c is a constant, then 1 — Ac is 
an eigenvalue of I — cA. 
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16. 


Le: 


18. 


19. 


20. 
21. 


22. 


23. 


24. 


25. 


26. 


27. 
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Using Exercise 15, prove that if p(A) < 1, then J — A is nonsingular. 
N 


Also show that (I— A)~!(I— AN++) = So A SP Ate Ae ee AN, 
i=0 


Show that back solving for Gaussian elimination (that is, show that 
completion of Algorithm 3.2) requires (n? + n)/2 multiplications and 
divisions and (n? — n)/2 additions and subtractions. 


Show that performing the forward phase of Gaussian elimination for 
Ax = b (that is, completing Algorithm 3.1) requires $n? + O(n) mul- 
tiplications and divisions. 


Show that the inverse of a nonsingular lower triangular matrix is lower 
triangular. 


Prove Equation (3.12) on page 109. 


Explain why A+) = MA and b(+) = M@b@, where A is 
as in Equation (3.10) on page 107 and M“” is as in Equation (3.11) on 
page 109. 


Explain why A = LU, where L and U are as in Equation (3.14) on 
page 109. 


Show that computing the inverse of a matrix A € L(R”) as in the note 
on page 110 requires n° + O(n?) multiplications and divisions. 

Hint: To achieve n* + O(n?) multiplications and divisions, you need 
to take advantage of the fact that the right-hand sides of the systems 
you are trying to solve are the unit vectors e;; otherwise, the number of 
multiplications and divisions is (4/3)n° + O(n?). 


Prove Proposition 3.10 on page 121. (The exact operation count you 
get may vary, because certain details of your algorithm, such as taking 
account of zeros in computing the inverses of the m by m blocks, may 
vary.) 


Show that, if U is unitary (that is, if U4U = J), then ||Uz|l2 = ||z|l2 
for every x. 


Show that, if U is unitary, then «2(U) = 1. 

Hint: &2(A) is the condition number of the matrix A with respect to 
the 2-norm; see (3.21) on page 122. You may wish to use the result of 
Exercise 25. 


3 -1 
6] andb= 0 
0 


1 2 
Let A= | 4 5 
7 8 1 


1 


(a) Compute (A) approximately. 
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(b) Use floating point arithmetic with 6 = 10 and t = 3 (3-digit deci- 
mal arithmetic), rounding-to-nearest, and Algorithms 3.1 and 3.2 
to find an approximation to the solution x to Ax = b. 

(c) Execute Algorithm 3.5 by hand, using t = 3, G = 10, and out- 
wardly rounded interval arithmetic (and rounding-to-nearest for 
computing Y). 

(d) Find the exact solution to Ax = b by hand. 


(e) Compare the results you have obtained. 


28. Derive the normal equations (3.29) from (3.28). 


29. Let A= and b= 


a ee 


1 
4 
5 
8 


WwNnNr © 


(a) Compute a QR factorization of A using Householder transforma- 
tions. (You may use a tool such as MATLAB’s, but show each step.) 


Pam 2. 
ace 


Compute a QR factorization of A using Givens rotations, showing 
each step. 


(c) Compute a QR factorization using some preprogrammed routine 
(such as MATLAB’s “qr” function). Are the three QR factorizations 
you have obtained the same? 


(d) Use one of the QR factorizations to compute the least squares 
solution to Ax = b. 


(e) If the three QR factorizations are not the same, then compute 
the least squares solution to Az = 6 using each of the distinct 
factorizations. Are the answers that you get the same, to within 
roundoff error? 


30. Let 
2 11 
A= {4 41 
6-5 8 
(a) Find the LU factorization of A, such that L is lower triangular and 
U is unit upper triangular. 


(b) Perform back solving then forward solving to find a solution x for 
the system of equations Ar = b= [4 7 15]?. 


31. Let A € R+9*(?+9% be a nonsingular matrix given by 


CO. Bt 
a=(50 ) 
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where C' € R?*? and B € R2*?. Assume A has an LU decomposition 


with 
I 0 CY 
ae ") and = ({ or 
Find X, Y, and Z. 


32. Let the n x n matrix A have elements 


1 
aij =} e” ef” dx 
0 


for 1 <i,7 <n. Prove that A has a Cholesky factorization A = a i 


33. Find the Cholesky factorization of 


tal 32 
A=|{-1 5 4 
2 4 29 


Also explain why A is positive definite. 


34. Let A be an invertible matrix. Suppose A, AA € R”*” and b, Ab € R” 
are such that Ax = b and (A+ AA)y = b+ Ab. Further assume that 


(i) ||AAl] < 6 [Al], 
(ii) ||Ab|| < 6 |[b||, and 
(iii) 6 &(A) =r <1, where «(A) is the condition number of A. 


Then 


(a) Show that A+ AA is nonsingular. 


1 
(b) Prove that llyll < pac 
I|z|| ~ L—-r 
O.la O.la 


35. Let A= ( 10 15 ) . Determine a such that «(A) is minimized. Use 
the maximum norm. 
36. Let A be n x n lower triangular matrix with elements 
1 ifi=y, 
aij = —1 if*=j+1, 
0 otherwise. 


Determine the condition number of A using the matrix norm || - ||oo. 


37. Let A= I—2 a? where x € R” and ||a||2 = V2. Prove that condition 
number k2(A) = ||A~"||2||A||2 = 1 (ie. A is perfectly conditioned). 
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38. Consider solving Ax = b by decomposing, 


a, C1 0 a1 0 0 11 0 
A= bo a2 C2 => bo a2 0 01 2 
0 bs a3 0 b3 ag 001 


Let |ai1| > [ci] > 0, |a2| > |b2| + |c2| and |a3| > |b3|. Show that 


(a) a; £0 for i= 1,2,3. 
(b) |y:| < 1 for i = 1,2. 


39. Consider the matrix system Au = b given by 


+ 0 0 O0*/ fu 1 
+ $ 0 Of | w 0 
1 1 1 ollas| Jo 
ie 3 a 3/ \u : 


(a) Determine A~! by hand. 
(b) Determine the infinity-norm condition number of the matrix A. 


(c) Let @ be the solution when the right-hand side vector b is perturbed 
to b= (1.01 0 0 0.99)7. Estimate ||u —%||.., without computing 


U. 
3 —5 
40. Let « = | 0 }. Find a Householder matrix H such that Ha = 0 
4 0 


41. Consider solving the matrix system Az = 6 by first factoring A into 
LU via the Doolittle Algorithm, then solving Ly = b and Ux = y. The 
Doolittle LU algorithm is given by: 


input n, (a:;) 
fork =1 to ndo 
line 


for 7 =k to n do 


k-1 
Ukj = Aki — y lksUsj 
s=1 


end do 
fori=k+1 ton do 
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k-1 
Qik — S lisUsk 
s=l1 


Ukk 


lik 
end do 
end do 
output (li;), (wij) 


(a) Show that LU factorization algorithm given above requires 


additions and subtractions. 


(b) Show that solving Ly = 6, where L is lower-triangular with 1;; = 1 
for all i, requires n?/2 — n/2 multiplications and divisions and 
n? /2 —n/2 additions and subtractions. 


(c) Show that solving Ux = y, where U is upper-triangular requires 
n? /2+n/2 multiplications and divisions and n?/2 —n/2 additions 
and subtractions. 


(d 


YS 


Add all operation counts in parts (a), (b), (c) and compare with 
Gaussian Elimination. 


42. For the system as in Example 3.14 (on page 153), 


(a) show that p(J) <1, but p(G) =1; 


(b) try x = (1,1,1)7 in the Gauss-Seidel method, and see what 
happens. 


43. Complete the computations, to check that 2 is as given in Exam- 
ple 3.15 on page 155. 


44, Use Lemma 3.11 to verify the statements in Example 3.16 (on page 157). 
45. Verify Equation (3.88) on page 169. 


46. Let A be the point matrix in Example 3.9 (on page 129), and let b 
be the point right-hand-side vector. Apply the interval Gauss-Seidel 
method (Equation (3.58)) to Ax = b, starting with °°) = [-10, 10], 


1 <i < 8, using the inverse midpoint preconditioner, using double 


47. 


48. 


49. 


50. 


51. 


52. 
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precision IEEE arithmetic and outward rounding,”° and iterating until 
the result is apparently stationary. Have you proven that the system in 
Example 3.9 has a solution? If so, then what rigorous bounds on that 
solution do you get? 


Repeat Exercise 46, but with interval Gaussian elimination (as in §3.3.7 
starting on page 130) instead of the interval Gauss-Seidel method. Com- 
pare the results. 


Let A be the n x n tridiagonal matrix with, 
4 ifi=J, 
aig = -1 ifi=jt+lori= 7-1, 
0 otherwise. 


Prove that the Gauss-Seidel and Jacobi methods converge for this ma- 
trix. 


Prove that if the matrix A = M — N is singular and M is nonsingular, 
then ||M~!N|| > 1, where ||.|| is any induced matrix norm. 


B z is positive definite, then the Jacobi 
method converges for a linear system Ax = b. (You prove that the 
Jacobi method converges for 2 x 2 positive definite matrices. This does 


not contradict Example 3.12 or the note preceding this example.) 


Prove that if matrix A = CG 


Let A = D—U —L, where A is strictly diagonally dominant, D is 
diagonal, U is upper triangular and L is lower triangular. Further- 
more, assume that D, U, and L have all nonnegative elements, that 
is, D,U,L > 0. Suppose 6 > 0. Consider the Gauss-Seidel iterative 
procedure 

Oe = (Dab) ea DHL) 


for m = 0,1,2,..., with 2 =0. Assume also that the spectral radius 
obeys 
p((D- L)""U) =7<1. 


Prove that 2°”) — x, where all the elements of x are nonnegative, that 
is, ¢ > 0. (Hint: Show that x”) > 0 for each m.) 


Consider the linear system Ax = b, where A= L+ D+U, Lis strictly 
lower triangular, D is diagonal, and U is strictly upper triangular. The 
SOR iterative method has the form 


at) 7, o® +e, 


26 such as in INTLAB 
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53. 


54. 


59. 


56. 


57. 


58. 
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where 
1 =7 
c= (z+ =D) b 
oO 
and 
Ts =(0 E+ D)“[(1—o)D— ot). 
Let 


Prove that the SOR method with o = 1 is not convergent, but the SOR 
method with ¢ = 4 is convergent. 


Suppose that A is irreducibly diagonally dominant and A = L+D+U 
with D = al. Let \ be an eigenvalue of A. Prove that |A/a — 1| < 1. 


Consider the linear system, 


3 2 gi\. [7 

24 ta) \10)° 
Using the starting vector x = (0,0)", carry out two iterations of the 
Conjugate Gradient Method to solve the system. 


Consider the two iterative methods considered in section 3.4.11. Let 
1 2 4.75 —1.875 
a cc mpg @e 0.90 ) 
(a) Find ||I — AF||oo. 


(b) Next, find t that will guarantee that ||v;— A71||o. < 1078 for each 
of the iteration methods described in section 3.4.11. 


Suppose that y; and yz are linearly independent eigenvectors of n x n 
matrix A with eigenvalues \; and A2 with A, # Ap. Suppose that 
ro = b— Azo satisfies ro = cry1 + coye. Let q = mc in Arnoldi’s 


||rol| 


method. 


(a) Show that the Arnoldi algorithm stops at k = 2. 


(b) Let y2 = yot+w where w belongs to the Krylov subspace K2(A, 1). 
Show that Azz = b. (Note that 6 — Arg L K2(A,190).) 


Prove Theorem 3.29 on page 177. (Hint: You may need to consider 
various cases. In any case, you'll probably want to use the properties of 
orthogonal matrices, as in the proof of Theorem 3.28.) 


Given U, 5, and V as given in Example 3.19, compute A*b by using 
Algorithm 3.9. How does the x that you obtain compare with the x 
reported in Example 3.19? 
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59. Prove Theorem 3.30 (on page 179). 


1 2 
60. Suppose A= | 4 5 
7 8 


61. 


62. 


3 
6 | can be written A = UNV", where U and V 
10 
are orthogonal, 


0.2093 0.9644 0.1617 
U = | 0.5038 0.0353 —0.8631 | , 
0.8380 —0.2621 0.4785 


17.4125 0 0 
us 0 0.8752 0], and 
0 0 0.1969 


0.4647 —0.8333 0.2995 
V & | 0.5538 0.0095 —0.8326 
0.6910 0.5528 0.4659 


Suppose we want to solve the system Az = b, where b = [1,—1,1]”, but 
that, due to noise in the data, we do not wish to deal with any system of 
equations with condition number equal to 25 or greater. Use the above 
singular value decomposition to write down the solution of minimum 
norm to a rank-two system of equations nearest to Ax = b, such that 
the computations proceed with a matrix whose condition number is less 
than 25. Explain what we mean by “nearest” here. 


1 2 
Find the singular value decomposition of the matrix A= {1 1 
1 3 


Consider the singular value decomposition of the matrix A € R™*” 
given by A = PDQ, where P is an m xX m unitary matrix, D is an 
m x n diagonal matrix, and Q is an n x n unitary matrix. Show that 
|| A(A? A)~1A?||2 = 1, assuming that A’ A is nonsingular. (Note that 
D is not a square matrix.) 


Chapter 4 


Approximation Theory 


4.1 Introduction 


In this chapter, we consider approximation of, for example, f € Ca, bj, 
where C[a, b] represents the space of continuous functions on the interval [a, }]. 
We are interested in approximating f(a) by an elementary function p(x) on 
the interval [a,b]. For example, p() could be a polynomial of degree n, a 
continuous piecewise linear function, a trigonometric polynomial, a rational 
function, or a linear combination of “nice” functions, that is, functions that 
are easy to use in numerical computation. 

We study approximation of functions in the setting of normed vector spaces. 
We begin with a review of normed linear spaces, inner products, projections, 
and orthogonalization. 


4.2 Norms, Projections, Inner Product Spaces, and Or- 
thogonalization in Function Spaces 


Recall the basic properties of normed vector spaces, which we review in 
Section 3.2 on page 88. Here, we will be using norms in the context of vector 
spaces without a finite-dimensional basis. 


Example 4.1 
Here, V is our vector space. 


(a) V = C[a, 6]. Two common norms are: 


(a1) |jullao = max, ((u(2)|0(0)), 


where p(x) > 0 on [a,b] and p € Cla, b]. This is called the Cheby- 
shev, uniform, or max norm with weight function p(z). 


b 3 
(a2) [lull = ( / bo olenr) | 
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where p(x) > 0,a<a <b, p € C(a,b), and 


i: p(a)dx < co. 


This is called the L?-norm or least squares norm with weight func- 
tion p(x). 


(b) If V =R”, then |]-||1, ||-||2, and ||-||.. are norms for R”. (See Section 3.2 
on page 88 and the subsequent pages.) 


4.2.1 Best Approximations 


Now let W be a finite-dimensional subspace of V. A typical problem in 
approximation theory is: Given v € V, find w € W such that the distance 
|v — w|| is least among all w € W. Such a w is called a best approximation in 
W to v with respect to norm || - ||. (For example, V = C[a, }], || - || = || - ||, 
and W = {set of polynomials of degree < n}.) 

Question: Does such a w exist? We will prove that the answer to this is yes. 


THEOREM 4.1 
(Existence of a best approximation) Let W be an n+1-dimensional subspace 
of a normed linear space V. Let uo, ui, ---; Un be linearly independent 
elements of W. (Thus, W = span(uo, Wi, U2,---,Un)-) Then there isap € W, 
n 


ie., p= >> aju; for a given f € V, such that 
j=0 


Li=2l= If —Srayu = min 
j=0 


YOsY15-+-Yn 


n 
Ses wail, 
j=0 


that is, || f —pl| < ||f—q|| for allq © W. (p is the best approximation to f EV 
with respect to norm || - ||.) This is illustrated schematically in Figure 4.1. 


PROOF The proof is divided into two cases. 


Case 1: Suppose that f is linearly dependent on upg, u1,...,Un. Then 


n 
f=) ajuj =p, 
j=0 


and ||f — p|| =0. That is, f € W, so p= f. 
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FIGURE 4.1: pis the best approximation to f € V. 


Case 2: Suppose that f ¢ W, so f is linearly independent of uo, ui, ..., Un- 
Let 


L= {we Viw= 0 2juj + 2ntifh, 
j=0 
where z; € R for j = 0,1,...,n+1. Then £ is an (n+2)-dimensional subspace 


of V. Let z = (20, 21,---,2n41)? € R"*?, where zo, 21,...,2n41 are given. 
Let || - ||« be defined on R"*? by 


n 
l2lle = || Do 25g + zeta 
j=0 


Then, 


1. ||z||. > 0 and ||z||, = 0 if and only if z = 0, since uo, wi, ..., Un, f are 
independent, 


2. |[Azl|« = |Al|lz|], and 
3. ||21 + alle < [alle + |lz2lls, 


because || - || has these properties. Thus, || - ||. is a norm on R"*?, ice., 
\| - ||. : R°*? — R. Now, define p to be such that 


n 
ia yw) 


j=0 


YOsY1 oss Yn 


n 
If pl =||f —Sleguj] = min 
j=0 
assuming that p exists. But 


n 
If - pl =|] - So aul] = lle - alle, 
j=0 
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where e = (0,0,...,0,1)7 and a = (a0, Q1,...,Qn,0)". Clearly, 


lle— all. = min le— zl. 


where G = {z € R"™+? : z = (20, 21,..-,2n,0)7}. Thus, if a € R"+? exists 
such that ||e — a||,. = minzeg |le — 2||., then 


n 
p= So ajuj eV 
j=0 


exists. (We have reduced the problem of finding p € W to finding a € R"*?.) 
Question: Can we find a ? 
Define H = {z€G: ||z— ell < |lel].}. Then, 


(a) H is not empty, since 0 = (0,0,...,0)7 € H. 
(b) H is bounded in R”*? since if z € H, then 


IlZll- < lz — elle + llell» < 2llell». 


Therefore, z has an infimum! yu = inf zx ||e — z||.. Let 


2) = (2, 2), oO OV eH ER 


be a sequence of vectors such that |/e—z||,, — jas t > oo. By the Bolzano- 
Weierstrass Theorem (i.e. every bounded sequence in R™ has a convergent 
subsequence), a subsequence of {z“)} has a limit point 2. Thus, a = 2. This, 
in turns, implies existence of p € W. 


Example 4.2 
Find the best approximation p € P® to f(x) = e* € C[0, 1] for the || - ||. and 
|| - |]z norms. (Thus, W = P°? CV =C[0, 1].) 


Find p th Be tre — = nh 
(a) Find p that minimizes ||e” — pl|oo gums le pi 


In thi —eo 1 * _p|=2(e-1). 
n this case, p= 5(e + yn mae ic p| = 5(e-1) 


1 2 
(b) Find p that minimizes ||e” — plz = (/ (er = p)?ae) : 
0 


In this case, p = e — 1 and |e” — pllz = $(4e — e? — 3)2, 


1A fundamental property of a finite-dimensional vector space is that every set that is 
bounded below has a greatest lower bound, or infimum. 
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We will now see that approximation in inner product spaces is straightfor- 
ward. We introduced the concept of inner product spaces in the context of 
matrix computations (i.e. finite-dimensional vector spaces) in Definition 3.19 
on page 91. We review this same concept here, in the more general context 
of function spaces. 


DEFINITION 4.1 A real vector space V is a real inner product space if 
for each u,v € V, a real number (u,v) can be defined with the properties: 


(i) (u,u) > 0 foruwe V with (u,u) = 0 tf and only if u=0, 
(ii) (u,v) = (v,u), and 
(itt) (au + Gv, w) = a(u,w) + B(v,w) for all u,v,w EV anda, BER. 


Example 4.3 
(of inner product spaces) 
b 
(1) V = Cla,}] with (f,g) = | p(x) f(x)g(a)dx, where p(x) > 0 for 
a<a<b,and pé€ C{a,). ° 


(2) R” with (2, y) = a7 y. 


U 


Unless we specify otherwise, when we say “inner product space” or “normed 
linear space” in the remainder of this chapter, we will mean “real inner product 
space.” Much of what is presented is also true in spaces over the complex 
numbers. 


REMARK 4.1 Complex inner product spaces can be defined analogously, 
with the following modifications to properties (ii) and (iii) of Definition 4.1: 


(ii)’ (u,v) = (v,u), 
(iii)’ (au + Bv, w) = A(u,w) + B(v, w) for all u,v, we V anda, BEC. 


Complex inner product spaces corresponding to Example 4.3 are 


b 
(1) V =f: [a,b] — C, f continuous, with (f,g) = fl (x) f (x)g(x)da, 


where p(x) > 0 fora <a <b, and p€ C[a, b). 
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(2) C” with (z,w) = 2" w, where 2” is the conjugate transpose of z. 


We will work with complex inner product spaces when we study trigonometric 
approximation in §4.5. 


THEOREM 4.2 
Any real inner product space V is a real normed linear space with norm 


defined by ||v|| = (v,v)?. 


PROOF Clearly (i), (ii), and (iii) of Definition 3.13 (the definition of 
norm, on page 88) are satisfied, while (iv) follows from the Cauchy—Schwarz 
inequality 

I(u,o)] < [ulflell Vu,w eV. 


In particular, 


Ju + vl? = (utv,utv) 
= (u,u) + 2(u,v) + (v, v) 
S lel]? + 2lfeal|llol] + loll? = (feel + [lell)?- 


Example 4.4 
(norms on inner product spaces) 


(1) V = Ca, }] is an inner product space with inner product 


b 
(f,9) = i fla)g(a)de 


and a normed linear space with the norm 


b 3 
lfl|= ( / Pet) , 


The Cauchy—Schwarz inequality for this inner product has the form 


< ( i, Pee) ( | roy] 


(2) V =R” is an inner product space with inner product 


a 
2 


i “Yala 


n 
a= 2aiH 29 
al 
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and a normed space with norm 


|e] = (>>) | 


The Cauchy—Schwarz inequality for this inner product is 


n 
s LiYi 
i=1 


REMARK 4.2 The concept of a Cauchy sequence (Definition 2.3 on 
page 40) can be generalized to normed linear spaces. A sequence of vectors 
U1, U2, --- in a normed linear space is said to be a Cauchy sequence if, given 
any € > 0, there is an integer N = N(e) such that ||un — um|| < € for all 
m,n > N. It is easy to show that every convergent sequence is a Cauchy 
sequence. A Cauchy sequence, however, may not be convergent to an element 
of the space. If every Cauchy sequence in V is convergent, we say that V 
is complete.2 A complete normed linear space is called a Banach space. A 
complete inner product space is called a Hilbert space. 


Example 4.5 
(of Hilbert and Banach spaces) 


(1) Let 2 = set of all real sequences u = {u1, U2,...} = {ui }22, that satisfy 


n 
> |uil? < co. Define 
i=1 


Co 
(u,v) = S- UjU;- 
j=l 


Then @ is an inner product space, and £2 can be shown to be complete. 
Thus, @2 is a Hilbert space. 


(2) Cla, b] with norm || - ||. is complete and is thus a Banach space. 


Basically, this means that the space contains all of its limit points. 
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4.2.2 Best Approximations in Inner Product Spaces 


Consider now the problem of finding the best approximation in an inner 
product space. Specifically, we wish to find w € W C VY that is closest to a 
given v € V, where V is an inner product space and W is finite-dimensional. 
Let W = span(wi,w2,...,Wn) C V, where {w;}"_, is a linearly independent 
set. We wish to find w € W such that ||w — v||? < ||u — v||? for all u € W. 
But 


Iw — ull? = (w — v,w — v) 
n n 
— So ajw; —v, > apwe -—v 
j=l k=1 
=S >> ajar (w;, we) — S> aj(v, wi) — S° an(v, we) + (0, 2) 
7 & j 


k 


= F(aj,Q2,...,Qn), 


n 
where w= )> aj;w;. 
= 


Thus, the problem reduces to finding the minimum of F as a function of 
Q1, Q2,..., A. Hence, setting OF’ /Oae = 0 gives 


S- aj (Wj, We) ~~ (v, we) = 0, 
j=l 


S/ aj (w;, we) = (v, we) for €=1,2,...,n. (4.1) 
j=l 


Equations (4.1) represent a linear system that is positive definite and hence 
invertible, and thus can be solved for the a;. In this case, the best approxima- 
tion w is called the least-squares approximation. (The matrix corresponding 
to the system (4.1) is often called the Gram matrix.) 


REMARK 4.3 Compare this with our explanation in Section 3.3.8.4 on 
page 139. In particular, in both (3.29) and (4.1), the system of equations is 
called the normal equations. The functions y in Section 3.3.8.4 correspond 
to the vectors w here. The dot product in Section 3.3.8.4 is the finite dot 
product 


(B00) = $7 o™ (tae) (te). 
k=1 


The latter is a dot product only if {pOyr is “linearly independent on the 
finite set {t.} 704.” 
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The following concept is closely related to finding best approximations in 
Hilbert spaces. 


DEFINITION 4.2 Let W be a finite-dimensional subspace of an inner 
product space V. An operator P that maps V into W such that P? = P is 
called a projection operator from V into W. 


REMARK 4.4 Projections are another useful way of defining approxi- 
mations. For example, P : V — W can be defined as 


n 
Pv= ; AkWk;, 
k=1 


where the a;,’s satisfy 


San (we, we) = (v, we) for 2=1,2,...,n 
k=1 


and {wi,...,Wn} is a basis for W. In this example, P is a “least squares” 
projection operator. 


We now revisit the concept of orthonormal sets of vectors, which we origi- 
nally introduced on page 92 in conjunction with QR factorizations. 


DEFINITION 4.3 (a restatement of Definition 3.21) Let V be an inner 
product space. Two vectors u and v in V are called orthogonal if (u,v) =0. A 
set of such vectors that are pairwise orthogonal is called orthonormal, provided 
(u,u) = 1 for every vector u in that set. 


Let w1, we,...,Wm be an orthonormal set in V, ie., (w;,w;) = 6;;. Let 
M = span(w, We,...,Wm) 
be the subspace in V spanned by wy,w2,...,Wm, ie, given v € M,v = 
m 
>> c;w;. Define 
i=l 


M+ ={v€V: (v,w) =0 for every w € M}, 


ie., the elements of M+ are orthogonal to those of M. 
DEFINITION 4.4. M+ is called the orthogonal complement of M in V. 


REMARK 4.5 M+ is a subspace of V. That is, 
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(a) 0O€ Mt, 
(b) if v1, v2 € M+ then v; + v2 € Mt, 


(c) if vu, € M+ then cyv,; € Mt. 


REMARK 4.6 Given u € V, we can associate with u two vectors Pu 


and Qu, with 
u= Pu+ Qu, 
where a 
Pu= So (u, we) we and Qu=u— 5S (u,wp)wr. 
k=1 k=1 
Clearly, Pu € M and Que Mt. 


DEFINITION 4.5 = The vector Pu is the orthogonal projection of u onto 
M and Qu is the perpendicular from u onto M. 


We now have 


PROPOSITION 4.1 
(Projections and best approximation) Let V be an inner product space. Given 
uéV, |lu— Pull < |ju—All for any h € M, where || - || = (-,-)?. Thus, the 
vector in M closest tou € V is Pu, i.e., Pu is the best approximation in M 
tou€ V with respect to norm || - ||. 
PROOF Leta=h—-Pu,aeM. Thenh=a+ Pu. Thus, 
|b — ull? = |la+ Pu — ull? = lla + el]?, 

where c= Pu —u. Continuing, 

|b — ull? = (a+¢,a+c) = (a,a) +(¢,¢), 
since c= —Qu € M+ and a€ M, so (a,c) = 0. Therefore, 


|| — ull? = lal]? + |Pu— all’, 


so ||Pu — ull < ||k — ul] for any he M. 


REMARK 4.7 Notice how easy it is to find a best approximation in an 
inner product space from a finite-dimensional subspace with an orthonormal 
basis. 
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REMARK 4.8 Consider this proposition geometrically for V = R® and 
M= span(w1, w2) = Span (1, 0, 0)*), (0, 1, 0)*) ‘ 


(See Figure 4.2.) Notice that M is the zy-plane. The vector in M closest to 
u is 
2 


Pu= y (U, Wk)We = UW + UgW2, 
k=1 


and Qu = u3(0,0, 1)". 


U= (ui, U2, U3) 


Qu 


Pu 
x 


FIGURE 4.2: The projection is the best approximation. See Prop. 4.1. 


REMARK 4.9 By Proposition 4.1, ||u— Pul| = ||Qu| is the shortest 
distance from subspace M to u in V. ] 


4.2.3 The Gram—Schmidt Process (Formal Treatment) 


Suppose now M has a basis {u1,u2,...,Um} that is not orthonormal. Can 
we find an orthogonal basis w1,w2,...,Wn? (Recall how easy it is to find a 
best approximation in M to V if M has an orthogonal basis.) This motivates 
the well-known Gram—Schmidt orthogonalization process. 

In Section 3.3.8 (on page 138), we briefly introduced the Gram—Schmidt 
process in the context of QR-factorizations. Here, we carefully consider the 
process formally. 


THEOREM 4.3 


(Gram-Schmidt Process) 
Let ui, U2,...,Um be linearly independent vectors (elements) in inner product 
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space V. If 
Uz= U1 
j-l 
(Uz, Uk) Uk . 
vj =uj—)> = for 7 = 2,3,...,m, and 
(UK, Uk) 
k=1 4 
ar 
i 
w= TT for 7 = 1,2,...,m, 
IIx; 
then v1,V2,...,Um is an orthogonal system, and wy 1,W2,...,Wm is an or- 
’ b b] 3 9 >) , 


thonormal system. Furthermore, 


M = span(u1, U2,...,Um) = span(v1, V2,...,Um) = span(w1, We,..-, Wm). 


PROOF (By induction) Suppose that 
M,; = span(ur, u2,...,uy) = span(vi,v2,...,v;) = span(wi, we,...,wy), 


where v1, ..., Uj; are orthogonal and wy, ..., w; are orthonormal. Now, for 
the induction step, assume that 


J 
(Uj+15 Uk) VK 

Usa. = Usj41 — ————_—. 4,2 
I+ a 3 (vp, 7) ( ) 
We have vj+1 ¢ M; and v;41 £0 since u;+1 is independent of ur, wa,..., uy. 
Also, (vj41,ve) = 0 for 2 = 1,2,...,7 (by simply plugging into (4.2)) and 
(w541,Wj41) = Ll since wj41 = v3 41/||v;+41||. Thus, v1, ..., vj41 are orthogo- 

nal and wj, we, ..., Wj41 are orthonormal. By construction, 

span(w1, We,.-.,Wj+1) = span(vj, V2,...,Uj41) C span(ur, u2..., Uj41). 


(Notice that by the induction hypothesis, each vz is a linear combination of 
U1, U2,..., uj for 1 <k <j.) Furthermore, we have 


J 
Uj; Uk 
Uj+1 = Vj41 + y sae 
(ve Ue) 


Hence, uj+1 € span (v1, v2,...,0;41). Therefore, 


Mj+1 = span(u1, u2,...,Uj+1) = span(vi,v2,...,Vj+1) 


= span(wy1, W2,.--,Wj+1) 


for 7 =0,1,...,m—1. 


Example 4.6 
(Two important cases) 
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(1) Legendre Polynomials 
Let V = C[-1,1], M =span(1, z, 2”), 


a= f sea x)dx for f,g € V, and ||f|| = (f, f)?. 


Notice that p € M has the form p(x) = a+ br + cx. Using the Gram— 
Schmidt process, 


VU= 1, 
1 1 
wy = ——— = =, 
(thas)? 
1p 
ve =“H2-- rdx = «, 
af, 
ip V2 x 
2 = = 5 
I|v2l| 2/3 


er ae ie ee Pde = 2-1/8 
3=2 Bh, xv a c= 2 /3, 
x? —1/3 x? —1/3 


W3 = 


Iz?—1/3] /8/45 
Thus, 
chen ze «2-1/3 
ce (= VaR ae). 


(Note: w1, we, and wz; are the first three well-known Legendre polynomi- 
als. The entire sequence of Legendre polynomials is formed similarly.) 


Now, we will find the best approximation to f(x) = e” € V in M relative 
to the inner product 


We need 
Pf = (f,we)we 
k=1 
Substituting the values of w, we have just computed, we find that 
(f,w1) © 1.661985, (f, w) + 0.9011169, and (f, ws) + 0.226302. Thus, 


1.661 = 
~ 1661985 | 0.90111692 9 dogany ( # 1/3 
V2 273 1/8/45 


.996293 + 1.103638a + 0.5367222°. 


x 


2 
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For comparison, the Taylor series about 0 gives e* © 1+ 2+ 27/2. See 
the following table. 


x Pf | Taylor series e” 
-1 0.429377 0.500 | 0.367879 
-1/2 | 0.578655 0.625 | 0.606531 
0 0.996293 1.000 | 1.000000 
1/2 | 1.682293 1.625 | 1.648721 


1 2.636654 2.500 | 2.718282 


(2) Fourier Series 
Let V = C[—7, 7] with inner product 


(f.9) = f° fle)gla)ae and If = 4.19", 


1 Tv 1 Tv 
- | cos £x cosmaxdx = Ogm, _ if sin x sin madx = dem, 


TT ss. 


1 at 
— if cos x sin madz = 0, 
T 


it is readily seen that 


1 cosx sing cosnx sinnx 


1 F 
are orthogonal. Let M = span (= saa ae = =) and let f eV. 
n 


Then the best approximation i 


( 
( 


1 
ao = (1). and ag = 
for 2=1,2,...,n. 


Before continuing, it is worthwhile to summarize our study so far: IfW Cc V 
is a finite-dimensional subspace of a vector space V consisting of “nice” (easy 
to work with) elements, a fundamental problem in approximation theory is 

“siven uv € V, find w € W close to v.” 
We saw the following. 
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(a) If V is a normed linear space, by Theorem 4.1, there exists a w © W 
such that ||w— v|| < ||w— vl] for all ue W. (However, as we will see, w 
may not be easy to find.) 


(b) If V is an inner product space, then the best approximation is easy to 
find, especially if an orthonormal basis is applied. 


In the remainder of this chapter, we consider specific vector spaces V, such 
as C[0,1], along with various choices for spaces W of nice functions such as 
polynomials, trigonometric functions, and continuous piecewise polynomials. 
Furthermore, we consider approximations in different norms or normed linear 
spaces. 


4.3. Polynomial Approximation 


The simplest choice for W is a set of polynomials. Polynomials are easy to 
work with and understand. Furthermore, the following Weierstrass Approxi- 
mation Theorem tells us that polynomials can provide good approximations. 


4.3.1 The Weierstrass Approximation Theorem 


The Weierstrass approximation theorem is 


THEOREM 4.4 
Given f € Cla, b] and € > 0 there exists a polynomial p(x) such that 


lp — flloo = amax, |f (2) — p(z)| <e. 


REMARK 4.10 This result is somewhat surprising, because f is only 
required to be continuous, and polynomials are smooth. For example, the 
graph of f(a) may be as in Figure 4.3, but a polynomial p(z) can be found 
such that || f — pllo < €, no matter how small we choose e. Basically, we can 
make the change of direction of a polynomial’s graph be arbitrarily abrupt by 
taking the degree of the polynomial to be sufficiently high. However, as we 
will see, the theorem is not practical for actually finding p(x). 


PROOF Let « = (b—a)t+a for t € [0,1]. Then 


h(t) = f ((b-a)t+a) 
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y 


a b 


FIGURE 4.3: Graph of a nonsmooth but continuous f(x); see Re 
mark 4.10. 


is continuous on [0,1]. We can therefore restrict our attention to h € C[0, 1). 
To see this, observe that 


Define B,,(h;t), m=1,2,..., by 


Bm(h;t) = 3 h (=) (7) Mina, (4.3) 


Clearly, B,,(h;t) € Pn, where P,, is the set of polynomials of degree m or 
less. (Bm(h;t) is called a Bernstein polynomial of degree m.) Furthermore, 
By,(h;t) can be regarded as a linear operator on Cla, b], since 


(a) By (Ah; t) = AB (h; t) 


In addition, 
(c) If hi(t) < he(t) for all t € [0,1], then Bn(hi;t) < By(h2;t) for all 
t € [0,1]. 


We now show that, given h € C[0, 1] and € > 0, there exists an integer mo > 0 
such that 
— |h(t) — Bmo (hs t)| <e. 


Hence, the theorem is proven by selecting 


p(2) = Bing (x —*) 
0 


Before proving this, let’s look at some special cases. (We also need these in 
the proof.) The special cases are 


Approximation Theory 
1. A(t) =1, 
2. h(t) =t, and 
3. A) =. 
In detail, 


k=0 

m 1 m—-1 1 
=o Jee = ve era ape 

k-1 ; J 

k=0 j=0 

m—-1 i 
ee )ea —t)™ 1) =tt+(1-d)"™ =t 

j=0 


form > 1. Thus, B,(t;t) =t form > 1. 
1 
3. Bm(t?;t) = Bm(t? — t/m;t) + —B(t;t) (using linearity of By, (-;t)) 
m 


= Bn 


— 


Hii Ota 
& es & (Ft /m 
k 


‘ k m—k 
a ttt - t+ tm 


Es (1 = ~) = ay) Pla 


Bm(t2;t) = 2 (1 = 53 Ge (1 —t)™-2-F 4 t/m 


ee 
= (1- =) +0/m 
1-4) 


=4 , for m> 2. 


207 
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Thus, 
t(1—t 
Beyer ph) oe as m— oO 
m 
and 
t(1—t 1/4 
[Bm (@st) — Pm = [A] = 2 
m m 


Co 


Thus, for || Bm (t?;t) — t?||oo < 107" requires m > 2,500,000. 


Conclusion 4.1 Bernstein polynomials are often not practical,® but are use- 
ful theoretically. 


Now let us continue with the proof. 


PROOF = (continuation; we need to show that ||Bm(h;t)—Alloo < € for h € 
C[0, 1] for m sufficiently large.) Recall that h € C[0, 1]. Suppose quax, |h(t)| = 
M. Then 7 

—2M < h(t) — h(s) < 2M for all t, s € [0, 1]. (4.4) 


Also, since h(t) is continuous on [0,1], given «; > 0 there is a 6 = d(€1) > 0 
that does not depend on s or ¢ (A is uniformly continuous) such that 


|t—s|<6 implies that —e« < h(t) —h(s) < «1. (4.5) 


By (4.4) and (4.5), for any s,t € [0,1], 
—e) — —-(t—s)? < A(t) -—A(s) <a, + rt —s)*. (4.6) 


To see this, first suppose that |t — s| < 6; then, (4.5) implies (4.6). Now 
suppose that |t — s| > 6 so that (t — s)?/d? > 1; then (4.4) implies (4.6). For 
the moment, fix s € [0,1]. Then (4.6) has the form hi(t) < ho(t) < h(t), 
where 


hi(t) = -e, — aL s)”, ho(t) = h(t) — A(s), and h3(t) = e, + — 


5 


By linearity and monotonicity of B,,(h;t) (see (a), (b), and (c) on page 206), 
we conclude that By,(hi;t) < Bm(h2;t) < Bm(h3;t), so 


Pn ((¢— 8)°;#) < Bmn(h5t) — h(s) Ser + ae B mn ((t— s)?5t). 


—€ — 


3with some exceptions. For example, in computer graphics rendering, high accuracy is not 
required to get a smooth picture, but the edges of the picture need to be smooth, and 
Bernstein approximations are extremely smooth. 
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But 
Bm ((t — 8)?;t) = Bm(t?;t) — 2sBrm(t;t) + 8? Br (15t) 


t(1—¢t 

SP ope per oa. 
m 

Thus, 

t(1—t)|] 2M 


|Bm(h; t) — h(s)| < 1+ |(t-— 8)? + le 


Letting t = s, we have 


s(l—s)2M M 
ae <eqyt+ smd? for all s € [0, 1]. 
Choosing €; = €/2 and m > M/(6?e), we have |B,,(h; s) — h(s)| < € for all 
s € [0,1], where ¢ is arbitrarily small. 


|Bm(h; s) — A(s)| Sea + 


The Weierstrass Approximation Theorem tells us theoretically that polyno- 
mials can be good approximations to continuous functions. In the remainder 
of this section, we consider practical methods for obtaining highly accurate 
polynomial approximations. 


4.3.2 Taylor Polynomial Approximations 


Recall from Chapter 1 (Taylor’s Theorem, on page 2) that if f € C”[a, b] 
and f‘*+) (x) exists on [a,b], then for xo € [a,b] there exists a €(x) between 
xo and «x such that f(x) = P,(x) + Rn(x), where 


2 elk) 
Pa(at) = > 2°) ( — ag), 


k! 
k=0 


and 


* (2 — t)” (+1) (€(x))(a — a)"+1 


P,(x) is the Taylor polynomial of f(x) about 7 = x and R(x) is the re- 
mainder term. Taylor polynomials provide good approximations near « = Zo. 
However, away from x = Xo, Taylor polynomials can be poor approximations. 
In addition, Taylor series require smooth functions. Nonetheless, automatic 
differentiation techniques, as explained in Section 6.2 on page 327, can be 
used to obtain high-order derivatives for complicated but smooth functions. 


4.3.3 Lagrange Interpolation 


Problem: Given n+ 1 distinct real numbers xp, 21, %2, ..., Yn andn+1 
arbitrary numbers yo, Y1, ---, Yn, find the polynomial of degree at most n 
such that y; = p(x;) for 7 =0,1,2,...,n. 


210 Classical and Modern Numerical Analysis 


We will address this problem in this section. 


THEOREM 4.5 

For any n+1 distinct real numbers xo, £1, ..., Ln and for arbitrary real 
numbers yo, Yi; ---; Yn, there exists a unique interpolating polynomial of 
degree at most n such that p(x;) = y;, 7 =90,1,...,n. 


PROOF We first define a useful set of polynomials of degree n denoted 


by 0, 41, .--, €n for points x9, 41, ..., Tn € Ras 
ie i Or 
L a k=0,1,...,n. 4.7 
n(x) ll gery »n ( ) 
1=0 
itk 


Notice that 
(i) £,(a) is of degree n for each k = 0,1,...,n. 


: 0 ifj#k, 
L 3 — 

een) : ifj =k, 

that is, y,(a@5) => Okj- 


Now let Ps 
p(x) = So yels(a). 
k=0 
Then 
p(@j) = > yele(es) = >— vadje = dy for f= 0,1,...,n. 
k=0 k=0 
Thus, 


P(x) = >> yele(x) 
k=0 


is a polynomial of degree at most n that passes through points (z;,y;), j = 
0,1,2,...,n. This is called the Lagrange form of the interpolating polynomial, 
and the set of functions {¢;}/_, is called the Lagrange basis for the space of 
polynomials of degree n associated with the set of points {x;}?p. 

We now show that the polynomial is unique. Suppose that q(x) 4 p(x), 
g(a) = p(z;) = yj, 9 = 0,1,2,...,n, and q and p are polynomials of degree 


at most n. Let p(x) = D> axx* and q(x) = >> Bya*. Then 
k=0 k=0 


n n 


p(x) — g(x) = S “(ax — Be)a® = S > ypc. 


k=0 k=0 
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(We will show that 7, = 0 for k = 0,1,2,...,n and thus, p(x) = q(x).) Since 


p(x) = 4(z5), 
ys Yen =0 
k=0 


for 7 = 0,1,2,...,n. This can be expressed as the linear system 
12g ae a3 ees oP Yo 0 
ies eve xt onl 0 
Ay — = 
1 ty, ip vee gn Yn 0 


The above matrix A is called the Vandermonde matrix. As shown in Re- 
mark 4.11 below, 


n k-1 
det(A) = |] [][ (@s — 2;) 49, 
k=1 j=0 
since z, # x; for 7 A k. Hence, A is nonsingular, so yo = 71 = ++: = Ye = 0. 
REMARK 4.11 ~~ Consider 
1 29 xe a8 os 2G 
1 a ee hie 
V(x) = det : — : =ag tayx+-:++anx” 
1 fn-1 eee oa Cn—1 
1 2 x? xz” 


(a polynomial of degree n). But 
Vi (@) = An(x — X0)(@ — 41)... (4% — &n_1), 


since the zeros of V,,(a) are %o,...,%n-1. (To see this, note that two rows 
of the determinant are identical when «x is replaced by 2;.) Furthermore, 
observing that a, = Vn—1(€n—1) shows us that 


| 
an 


n 


Vn(2) = Vn-1(@n—1)(@ — Uo)... (@ — 2n—-1) = Vn—-1(Ln-1) (x — 45). 
0 


&. 
ll 


(There are various ways of seeing this, involving subtracting multiples of one 
row from another, factoring out common factors from rows or columns, and 
expansion by minors.) Therefore, by observing Vi (#1) = (a1 — Zo) and using 
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an induction argument (or else repeating the above process with V,—1(x) in 
place of V,,(x)), we obtain 


n k-1 
det(A) = V, (an) = II (t~ — 25), 
k=1 j=0 
where A is the above Vandermonde matrix. 0] 


Summarizing, we obtain the Lagrange form of the (unique) interpolating 
polynomial: 
n 


p(t) = So yxle(e) with &(x) = TT 
k=0 


j=0 
j#k 


L— &5 


(4.8) 


Lk — Lj 


where {¢,};, is the Lagrange basis for the space of polynomials of degree n 
associated with the set of points {x;}%_ 9, and we will call the ¢; the Lagrange 
basis functions. 


REMARK 4.12 An important feature of the Lagrange basis is that it is 
collocating. The salient property of a collocating basis is that the matrix of 
the system of equations to be solved for the coefficients is the identity matrix. 
That is, the matrix in the system of equations 


{p(xi) — tela 


to solve for the c; in the representation 
n 
pa=) 44) 
k=0 


is 


fo(to) £1(x0) ln (x0) Co Yo 
fo(a1)  €1(#1) ln (1) C1 YI 

2 (4.9) 
fo(tn) f1(tn) +++ ln(en) Cn Yn 


is the identity matrix. (Contrast this to the Vandermonde matrix, where we 
use x* instead of ¢;,(x). The Vandermonde matrix becomes ill-conditioned for 
n moderately sized, while the identity matrix is perfectly conditioned; indeed, 
we need do no work to solve (4.9).) 


Although the interpolating polynomial is unique, there are various ways 
of representing it, just as there are various bases that can be used to repre- 
sent a vector in (n + 1)-dimensional space. In the next section, we consider 
representing the interpolating polynomial in the so-called Newton form. 
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4.3.4 The Newton Form for the Interpolating Polynomial 


Although the Lagrange polynomials form a collocating basis, useful in the- 
ory for symbolically deriving formulas such as for numerical integration, the 
Lagrange representation (4.8) is generally not used to numerically evaluate 
interpolating polynomials. This is because 


(1) it requires many operations to evaluate p(x) for many different values 
of x, and 


(2) all @,’s change if another point (@n41, Yn+1) is added. 


These problems are alleviated in the Newton form of the interpolating poly- 
nomial. To describe this form we use the following. 


DEFINITION 4.6 y[x;,2j41] = 24+ — is the first divided difference. 


Uj41— Lj 
DEFINITION 4.7 
Yy[Xj41,---,Lj+k] — y[xj,.--,2j4+K-1] 
y[z;, Lj+15+-- 25 +k] = AGE cE Ee Dee A el 
Ljtk — U5 


is the k-th order divided difference (and is defined iteratively). 
Consider first the linear interpolant through (2, yo), and (#1, y1): 
pi(a) = yo + (a — xo)ylao, v1], 
since p1(x) is of degree 1, pi(ao) = yo, and 


Yi = Yo _ 


pi(z1) = yo + (1 — les an 


Y1- 


Consider now the quadratic interpolant through (20, yo), (#1, y1), and (x2, y2). 
We have 

po(x) = pi(x) + (% — xo)(x — x1 )y[x0, £1, 2a], 
since p2(x) is of degree 2, po(xo) = yo, p2(v1) = yi, and 


MN Beds Gade a) 


L2— LO 

Y¥1 — Yo Y2—Y1 Y1 — Yo 

= yo + — —— + = —— — (x42 — 41 )——— 

yo + (x2 2 eee (x2 2S ( ae 
= Yor Y2—- V1 
1 

[(w2 — 4 — 2 + 21)y1 + (—22 + Xo — £1 + £2) Yo] 
ZL, — XO 


= Y2. 


214 Classical and Modern Numerical Analysis 
Continuing this process, one obtains: 


Dn(L) = Pn—1(x) + (a — w)(@ — 21)... (@ — @n_-1)y[@0, 11, ---, En] 
Yo + (a — Xo) yao, 21] + (w@ — Xo) (a — 21) y[v0, 41, La] +... 


{HT wnhy [t0,---, Xn]. 


This is called Newton’s divided-difference formula for the interpolating poly- 
nomial through the points {(z;, V5) }5—0- This is a computationally efficient 
form, because the divided differences can be rapidly calculated using the fol- 
lowing tabular arrangement, which is easily implemented on a computer. 


J) x; ylvj, 541] ylxj,vj+1, 25 +2] las, 541,25 42,0543] 
0 1 es 
if: [ow | SEE Soe | 


¥3—Y —, 19 ©15e Seat YL] 22,73) —YitQ,7152 
Sf Ys=¥2 = ylos, x5] aS wep to ts) ulho tial 


Y4a-Y3 __ yle3,e4)—yleo.e3) _ yleg,x3,¢4]—-yle1 22,23] 
4| x4 D423 = ylxs, xa] g4-0Q. = y[x2, 73, £4] Z4—@41 


Example 4.7 


1 
Consider y(a) = if 
~oo V2T 


x 


e732” dex (standard normal distribution) 


vie rn ylaj, e541] ylxj, ©j41, 0542] ylxj,€j41,0j 42,2543] 
>from — 

zfsfoosa, noon EE a 

3 


2.0] 0.9772 0.0655 -0.0725 O07 AS +0. 08875 = 0.02708 


pi(x) = 0.9192 + (« — 1.4)0.130 


is the line through (1.4,0.9192) and (1.6, 0.9452). Hence, y(1.65) ~ p1 (1.65) = 
0.9517. Also, 


po(x) = 0.9192 + (a — 1.4)0.130 + (a — 1.4)(x — 1.6)(—0.08875) 


is a quadratic polynomial through (xo, yo), (@1,y1), and (x2,y2). Hence, 
y(1.65) = po(1.65) = 0.9506. Finally, 


p3(x) = po(x) + (x — 1.4) (x — 1.6) (x — 1.8)(0.027083) 


is the cubic polynomial through all four points and y(1.65) ~ p3(1.65) = 
0.9505, which is accurate to four digits. ] 
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REMARK 4.13 _ If the points are equally spaced, i.e., vj41 — vj; = Av 
for all 7, Newton’s divided difference formula can be simplified. (See, e.g., 
(6].) The resulting formula is called Newton’s forward difference formula. If 
the points %o,...,% are reordered to %p,pn—1,.--,Xo, the resulting formula 
is called Newton’s backward difference formula. 


REMARK 4.14 The functions 

k 

Ni(z) =][(@-a), &=1,...,7 (4.10) 

i=0 
form a basis for the space of polynomials of degree n. In contrast to the 
Lagrange basis, this basis and the conditions {p(z;) = yi}%_> do not lead 
to a linear system of equations in which the matrix is the identity matrix, 
but do lead to a lower triangular system of equations. You will show this in 
Exercise 16 in this chapter. 


4.3.4.1 An Error Formula for the Interpolating Polynomial 


We now consider the error in approximating a given function f(x) by an 
interpolating polynomial p(x) that passes through the n+1 points (a, f(x;)), 
POs. 35th 


THEOREM 4.6 
If 0, %1,-.-,2p aren+1 distinct points in [a,b] and f € C"*1[a, b], then for 
each «x € [a,b], there exists a number € = €(x) € (a,b) such that 


fOrY(e(x)) T] (w — 25) 
(4.11) 


PROOF 


Case 1 Suppose that « = x, for some k, 0 <k <n. Then, f(a,) = p(xz) 
and (4.11) is satisfied for €(x,) arbitrary on (a,b). 


Case 2 Suppose that « 4 x, for k = 0,1,2,...n. Define as a function of t, 


t— 2; 


Ss 
Fae 
Say 
I 
a 
| ihe 
sy 
| 
3 
Feit! 
ay 
| 
Bo: 
= 
oN. 
oe 
| 
3 
zy 
& 
= 


L— Ly 
i=0 


Since f € C™*[a, db], p € C™[a,b], and x # 2; for any i, it follows that 
g €C™ 1a, b]. For t = xx, 


nm 
Lk Xj 


(rr) = f(r) — p(w) — (F(x) p(x) TJ 
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Moreover, 


(2) = f(x) — pla) ~ (F(2) ~ p(a)) T] > =0. 


i=0 


Thus, g € C"*"[a, b] vanishes at the (n+2) points 79, 71,...,2n, 2. If g(x0) = 
0 and g(#1) = 0, there is an 2, vo < x} < £1, such that g'(xj) = 0. (This is 
called Rolle’s Theorem, a special case of the Intermediate Value Theorem as 
stated on page 1.) Thus, g’(t) vanishes n+1 times. Continuing this argument, 
g'"*1) (t) vanishes at least once on (a,b). (This is called the Generalized Rolle’s 
theorem.) Thus, there exists a € = €(x) in (a,b) for which gt) (€) = 0. 
Evaluating g("*) (t) at € gives 


(ntl) (e) — ¢(n4l) (n+1) ae t— x; 
0O=9 ()=f (€)—p (€) —[f(x) — p(z)] Ge : 
= 


i=0 L— XM 
so 
n ! 
0= FY — (F(a) — p(z)) 
II (x — 2) 
i=0 
Therefore, 


U 


In the next section, we will see yet another basis for the set of polynomials 
of degree n that is useful in analyzing the error (4.11) and reducing it. 
4.3.4.2 Optimal Points of Interpolation: Chebyshev Polynomials 

Now consider the error estimate (4.11). We have 

1 
~plles= = S| FO Wiles 4.12 
If — Plloo = max, |f(#) — p(2)| < mail Ilool|W]loo, (4-12) 
where 


W(x) = [[(@-2)). 


j=0 


We see that the error bound depends on the nodes {xj }9 through ||W||.0. 
We thus pose the following question: Can we choose {x;}?_, suitably so that 
||W||.o is minimized? We first will consider the interval [a, 6] = [—1, 1]. 
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THEOREM 4.7 ss 
The uniform norm of W(x) = [] (@ — 2) 1s minimized on [—1,1] when 
i=0 


and the minimum value of the norm is ||W||~o = 27”. 


PROOF We make a change of variables on [—1, 1], letting « = cos@ and 
restricting 6 to [0,z]. Consider the trigonometric identity 
(4.13) 


a) 


cos ((k + 1)@) = 2cos(@) cos(k0) — cos ((k — 1)6), k = 0,1,2, 


and define for 6 € [0,7], 
T(x) =cos(k0), @=arccosxz, k=0,1,2,.... (4.14) 
In terms of the T;,(a)’s, (4.13) gives the recursion relation 
(4.15) 


Tr+i(v) = 2T;(x)T, (2) — Tr-i(a), k=1,2,3,.... 


Also, by (4.14), To(x) = 1 and T,(x) = 2, so (4.15) gives To(x) = 22? — 1, 
T3(x) = 4x3 — 3x, Ty(x) = 824 — 82? +1, .... By induction, we can conclude 


(a) T(x) is a polynomial in x of degree k; 

(b) the leading term of 7;,(x) is 2*-!x*. 

The polynomials T;(2) are called Chebyshev (or Tschebycheff).* By (4.14), 
we see for k > 1, Ty(x) = 0 implies that cos(k@) = 0 with 0 < k0 < kz. 


Hence, kO = (21 —1)m/2, 1 <i<k,ie., 


Thus, the zeros of the k-th Chebyshev polynomial T;,(x) are 
(4.16) 


(Fes). jpegegn 


2) = 


We now consider the polynomial W(x) = 2-"Tn11(z) on [—1,1]. By (4.16), 


the zeros of W are 
2i+1 


1 = cos ( . )osisn. 
n+1 


4The original spelling uses the Cyrillic alphabet. 
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Also, since W(x) is a polynomial of degree n+ 1 with roots x;,0<i<n, it 
must be equal to W(x) = [] (# — 2;), where W is as in the statement of the 
i=0 
theorem. Hence, 
W(x) =2°-"Tn41(2). (4.17) 


Now, the maximum value of |W()| on [—1,1] occurs at the n+ 2 points z; 
where T,,41(z;) = +1. By (4.14), these points are 


= 00s(—*_)., 4=0,1,2,...n+1. 
n+1 


Therefore, ||W||.. = 27”. Now consider V(x) of the same form as W(a) but 
I[V loo < Whoo, ie, 


V(x) = [[@- 4), 
i=0 
&; # x; for at least one i. Notice that W(z;) = (—1)'2-". Thus, as is depicted 
graphically in Figure 4.4, W — V must alternate in sign at each of the points 


FIGURE 4.4: The Chebyshev equi-oscillation property (Theorem 4.7). 


20; 21, 22; +++; 2n4+1, 80, by Rolle’s theorem, W — V must have a root in 
each of these n + 1 intervals. However, since both V and W are monic,° 
W — V is a polynomial of degree n or less. Therefore, W — V = 0, that 
is, W(x) = V(x), thus contradicting ||V||oo. < ||W loo. We conclude that W 
minimizes the maximum deviation from 0 among all polynomials of degree 
n+l. 


REMARK 4.15 If the interval of approximation is [a,b] rather than 
[—1, 1], we use the transformation 


_ a-—2y+b 
gem Oy 


5that is, since the leading coefficient of each is 1 
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to map [a,b] to [—1, 1]. We now approximate 
1 1 
ge) = f(5(6-ae+ 5(a+d)) 


for -—1 < a <1. Thus, the corresponding “optimal” interpolation points in 
[a, b] are 


1 
wi = 5[(b-a)ai t+ (a+d)], O<i<n, 
where the x; are given in Theorem 4.7. In this case, one can show that 
||W loo = 277" "(ba)" 


(Exercise 7 on page 285). 


REMARK 4.16 By Theorems 4.6 and 4.7, using the zeros of the Cheby- 
shev polynomial as the interpolating nodes, one obtains 


1 


Ga Tilt Mlleo(b — ayetaran”. (4.18) 


If — Plloo $ 


Compare this to the general error bound: for any set of distinct nodes, 
|W loo < (b— a)", so one always has 


If — Plo < Gp ool — a. 


REMARK 4.17 Consider now fixing [a, b] and letting n — oo. One would 
expect || f — pn|lo — 0 as n > co where p,, is the interpolating polynomial of 
degree n. However, this is not true in general. For example, consider Runge’s 
function: 


f(x) = 


If we use equally spaced nodes on [—5, 5], it can be shown that ||pn— flo. — 00 
as n — oo. However, if the nodes are the Chebyshev points and f € C?[—1, 1], 
then || f — pn|| > 0 as n > oo. (In fact, || f — Dall = O(1/V/n) as n = oo.) 
Nevertheless, in the general case of f € C[—1, 1], it can be shown no matter 
how the interpolating points are distributed, there exists a continuous function 
f for which ||pn — flo — co as nN > oo. 


Tope € C7 (5.5). 


This observation motivates us to consider other methods for approximating 
functions using polynomials, in particular piecewise polynomial interpolation. 
We will do that later. First, we briefly consider interpolation in which we 
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specify not only function values, but both function and derivative values at 
points. A simple scheme for doing so is Hermite interpolation. 


4.3.5 Hermite Interpolation 


In Hermite interpolation, we find the polynomial p(x) of degree at most 
2n —1 that interpolates the data’ (aj, f(xi)), (xi, f’(zi)) for i = 1,2,...,n, 
where p(x;) = f(a) and p’(a;) = f’(a;) for i = 1,2,...,n, and the x; are 
distinct points. We have the following. 


THEOREM 4.8 

If %1, 2, ..., Ln € [a,b] are distinct points and f € C*[a,b], the unique 
polynomial of degree at most 2n—1 that agrees with f(x) and f'(a) at 11, x2, 
1+, Ln ts given by 


hy(x) = [L — 20, (wn)(a —ae)](Cx(a)), and hy (x) = (a — ae)(€x(@))?, 


L— 2X; 


e;,(a) = Ga u a =U , where ©,,(x) = [| @- 2x). 


Lk Xj 


2 
f(x) — Hn(2) =p") for some € = E(x) € [a,b]. (4.19) 
H,,(«) is called the Hermite interpolating polynomial. 


REMARK 4.18 The @ in Theorem 4.8 are the same as the Lagrange 
basis functions we saw on page 212. ] 


6One observes that the interpolating polynomial to Runge’s function oscillates wildly as the 
degree increases. Specifying the derivative at points conceivably will limit rapid increases 
and decreases. 

THere, it is more convenient to start the indexing with 1, rather than 0. 
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PROOF Note that 
hi.(a;) =he(aj) =0 for 1<k,j<n, and 


+ 1 k=j 

hp(a;) = hi,(a;) = 

xe ( i) ie ( i) f hej. 
Thus, H,(x;) = f(x;) and Hj(a;) = f’(x;) for 7 = 1,2,...,n, so H,(z) 
satisfies the interpolation requirements. To show uniqueness, suppose G(x) of 
degree less than or equal to 2n — 1 also satisfies the interpolation conditions. 
Let 

R(«) = A,(x) — G(x). 
Then R(2;) = R'(x;) = 0, so R has double roots at 21,...,2,. Hence, 
R(az) = q(x)(x — 21)?(@ — £2)" ... (2 — Bn)? 


for some polynomial q(x). If q(x) 4 0, then R(x) has degree greater than or 
equal to 2n, which is a contradiction. Thus, q(#) = 0, so R(x) = 0, so the 
Hermite interpolating polynomial must be unique. 
Now consider (4.19). The formula is trivially satisfied for « = x; for any 7. 
Suppose that « 4 2; for any 7. Let 
(t—21)?...(t-— 2)? 
g(t) = f(t) — Hn(t) —- —— 37 — —~ (f (@) — n(2)) . 
(%@ — 21)?...(@ — tn) 
Then g(z;) = 0 for each i and g(x) = 0, so Rolle’s Theorem implies that g’(t) 
has n distinct zeros €;, 1 < 7 < non [a,], with £; A x; for any 7. Examining 


of 


~ dt | (a@— 21)... (x — tn)? 


g(t) = f'(é) — Hit) (F(0) - Hale) 
) has 2n distinct zeros 


we see that g’(a;) = 0 fori = 1,2,...,n. Therefore, g’(t 
t) has at least one zero 


on [a,b]. By the Generalized Rolle’s Theorem, g?")( 
&(x) € [a,b]. Thus, 


g)(E(a)) =0 = f2%(2) —0- — -(f(e) ~ Hol), 
ie £5) 
. (W,(¢))? 
Un(x on 
f(a) ~ Hy(0) = TX pm) (0) 


0 


In the next section, we consider approximation by polynomials in such a way 
that the graph does not necessarily go through the data exactly. This type of 
approximation is appropriate, for example, when there is much data, and the 
data contains small errors, or when we need to approximate an underlying 
function with a low-degree polynomial. 
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4.3.6 Least Squares Polynomial Approximation 


We introduced the least squares problem in Section 3.3.8 on page 139, where 
we saw that a QR decomposition was a stable way of computing solutions. 
We also saw, in Section 3.5 (page 174), that least squares solutions to discrete 
problems could be analyzed with the singular value decomposition. Here, we 
will examine the least squares problem in the context of spaces of functions 
and orthogonal polynomials. 

Recall the general least squares problem: 


DEFINITION 4.8 Let V be an inner product space and let W C V be a 
finite-dimensional subspace of V. Suppose 


W = span(wi,We,..-,Wn), 


and suppose that v € V. The general least squares problem is to find w © W 
such that 
lw — vl]? < |lu— oll? for allue W, 


where || f|| = (f, f)?. That is, we minimize 


nm n 
F(a1,Q2,...,Qn) = ) ajw; —v, > QpwR—-v], 
j=l k=1 


where 


w= Dy QAjW;. 
j=l 
We saw that we could find a1, a2, ..., Gn by solving the linear system 
S/ a5 (wz, we) =(v,we), £=1,2,...,n. (4.20) 
j=l 


That is, Ada = b where 


(wi,wi) (wi,we) --+ (wi,wn) ay (v, w1) 

en ta a2 (v, We) 
fear is . = 

(Wn, W1) cris (Wn, Wn) An (v, Wn) 


(4.21) 
We have previously seen (4.21) both in Section 3.3.8, where we saw how to 
solve (4.21) with a QR factorization in the case of finite-dimensional Hilbert 
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spaces, and in Section 4.2.2 (on page 198). There, we mentioned that A 
is positive definite and thus nonsingular. (You will show this in Problem 8 
on page 285 below.) Thus, a is uniquely determined. We also saw that 
{wi,...,Wn} could be made orthogonal by the Gram—Schmidt procedure. 
Suppose that {wj,...,w,} are orthogonal. Then (4.20) becomes 


(4.22) 


and 


is the solution to the least squares problem. 
In this section, we consider V = C[a,b] and W = P”, the space of polyno- 
mials of degree less than or equal to n with the inner product 


— f(a (a)dx, with p(x) > 0 for x € (a,b). 


Thus, W = span(1,2,27,..., 2"). 


REMARK 4.19 Generally, to find the least squares approximation, an 
orthogonal set of polynomials is first found with respect to weight function 
p(x). The reason is that A in (4.20) can be very ill-conditioned. Consider, for 
example, a = 0, b = 1, and p(a) = 1. Then the matrix A has the form 


(1,1) (ya) (2%) ee eee 
(a? 1) ee) (a a”) wT ek oe 


which is a Hilbert matrix. (We saw that Hilbert matrices are very ill-condi- 
tioned; see Table 3.1 on page 125.) 


REMARK 4.20 We give intervals, weight functions, and recurrence rela- 
tions for several classical sets of orthogonal polynomials. Given a, b, and p(x), 
the set of orthogonal polynomials can be obtained using the Gram—Schmidt 
process. However, we will see later that any set of orthogonal polynomials 
also obeys a three-term recurrence relation. 


(a) Legendre polynomials P,(x) 
a=-1,b=1, p(x) = 1, P(x) =1, (x) =a 
Recurrence relation: 


(n+ 1)Pr4i(x) = (2n + 1)aP, (x) —nPr-i(x), n>1. 
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(b) Chebyshev (Tschebycheff) Polynomials T,, (x) 
a=-—1,b=1, p(x) =1/V1—2?, To(a) = 1, Ti(a) = 2 
Recurrence relation: 


Tn4i(“) = 2aT, (x) —Tr-i(z), n>. 


Also, T,,(x) = cos(n0), « = cos 0. 


eo) 
NV 


Hermite Polynomials H,(x) 
a=—oo, b=, p(x) = en? Ho(a) = 1, Hy(a) = 2a 
Recurrence relation: 


An41(©) = 22H, (x) —2nHy-1(@), n>1. 


Fre 
i 


Laguerre Polynomials L,,(x) 
a=0, b=, p(x) =e", Lo(a) = 1, In(a) =ax-1 
Recurrence relation: 


Lnyi(t) = (—2n-14 2)En(z)—n?En-i(z), n>1. 


Example 4.8 
Find the least squares cubic approximation to f(a) = sin za/2 on [—1, 1] with 
respect to the inner product 


(9, n= f g(x)h(x)dx. 


1 


Solution: The Legendre polynomials are orthogonal on [—1, 1] with this weight 
function. We have 


Poi(xz)=1, Pi(x)=a2, Po(x)= 5 (32? —1), Ps(x)= (52° — 32), 
(f,Po) =0, (f,P:) = 0.81057, (f,P2)=0, (f, Ps) = —0.06425, 


(Py, Po) =2, (Pi, Pi) = 0.66667, (P2, Pr) = 0.20000, (P3, P3) = 0.28571. 


Then, 
3 
(Pj, f) 
w(z) = 2, 05Pi(e) 95 (P, Py’ 
sO 


w(x) = 1.55322 — 0.562232? ~ sin > 


We obtain the results in Table 4.1 1] 
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TABLE 4.1: Legendre 
polynomial approximation of 
degree 3 to sin(7a/2) 


x | sin(wa/2) w(x) 


We have the following interesting and useful results about orthogonal pol- 
ynomials. 


PROPOSITION 4.2 
Tf {en(x)}°p ts an orthogonal sequence of polynomials and yy (x) is of degree 
n for each n, then 


span(1,x,27,...,2™) = span(yo(x), y1(2),.--,Ym(z)) = P™ 


for each m. 


PROOF (by induction): First, P° = span(1) = span (yo(x)). Now sup- 
pose that 


=) 


prt= span(1,z,...,2 = span (Yo(2),---;m—1(2)) , 


and consider y»(x): 
m-1 oe 
Pm (£) = Cma™ + S- ex’ = Cma™ + S- Epi (a). 
i=0 1=0 
Thus, 
Hence, 


span(1,a,...,2’") C span(yo(x), yi(x),.--,m(x)) C span(1,z,...,2""). 
Thus, P™ = span (1,2,...,2"") = span (yo(), y1(2),.--,@m(x)) for each m. 


REMARK 4.21 If p € P”, then p(x) = S > ejy;(2) = > 
j=0 j=0 
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PROPOSITION 4.3 
Let {pn(x)}929 be an orthogonal family of polynomials on Cla, b] with respect 


to the inner product (f,g) = oi f(a)g(x)p(a)dx, with p(x) > 0 fora<a<b. 
Let n(x) be of degree n. Then pn(x) has n distinct zeros in open interval 


(a,b). 

PROOF = (by contradiction) Let 71, 22,...,%m be zeros of y,(x) for which 
(a) a<a; < band 
(b) yn(x) changes sign at x;. 


Since the degree of yp, (x) is n, we know that m <n. Assume m <n. (We 
will show that this is impossible.) By the definition of 71,72,...,@%m, let 


by 
& 
fl 
4: 


Then 9, (2) B(x) = (w@— 2%1)...(@ — 4m) n(x) does not change sign on (a,b). 
To see this, the assumptions on y,,(«) imply that 


pn(a) = h(x)(@ — 21)" (a — a2)"? ...(a@ — 2m)", 
such that each r; odd and such that h(x) does not change sign on (a,b). Thus, 
B(x) Gn(x) = h(x) (x — 21)" ** (a — 22)" ** ... (2 — tm)" 


does not change sign on (a, b). Consequently, 
b 
| B@en@p(oar 40, 
But since the degree of B= m < n, 
B(x) = 5 _ ¢59;(2), 
j=l 
where c; = (B, y;)/(~;,;) by Remark 4.21. Thus, 


b m b 
i. B(x) pn(x)p(a)dx = a of 3 (2) Yn (x) p(x)dx = 0, a contradiction. 


j=1 


Thus, m = n and the theorem follows, since y,,(a) can have at most n roots 
and the assumptions on 271, %2,...,%, imply that they must be distinct. 


Example 4.9 
We give roots of the Legendre and Chebyshev polynomials. 
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(a) Legendre polynomials (The interval is (—1, 1).) 


Po(x) = 1; 

P,(2) =, x1 = 0; 

Pay sOP ED. 2a ers 
J3 3 


P3(x) = =(52° — 32), 2, =0, 2 = —/3/5, 23 = +/3/5; 


(b) Chebyshev polynomials (The interval is (—1, 1).) 


atts) fori =0,1,...,.n—1. 


T(z) = 0 at 2; = cos 


0 


The following result tells us that any set of orthogonal polynomials is char- 
acterized by a three-term recurrence relation. 


PROPOSITION 4.4 


Let yo(x) = 1 and let 
b 
Ce Fa 5 OO eee / PONOWOL. 
(Yo, Yo) a 


Suppose that v(x) is of degree k with leading term x* fork > 0. Then 
{Vr(x)} Po satisfies the three-term recurrence 


pr(x) = (x = Br) pr—1(2) = Crer—2(2), (4.23) 
where 
Be (xpe-1, Pk—-1) aids eS (TpKR-1, Pk—2) (4.24) 
(Pmt Pr—1) (Pr—2, Pk—2) 


for k > 2 if and only if {yx}PR_-9 are orthogonal on the interval [a,b] with 
respect to the weight function p(x). 


PROOF We first suppose that the set {y,(x)}?29 is orthogonal. We 
need to show that y;,(2) satisfies the three term recurrence (4.23). First, we 
observe that yp+41(2) — ryx(x) is of degree k. Then, 

gye+i(z) — eyn(2) = —Brzrpe(2) — Cr+ipe—1(2) — +++ — Setivo(a) 


because 
prei(x) — cyR(x) € P* = span(yo, 91,---, x) 
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by Proposition 4.2. But 


(YR+1 — CLK, Pj) = (PR+i, 95) — (PR, e~;) = 0 
for j =0, 1, 2,...,k—2. Thus, ge41 — tye = —Bryiye — Cetige—i, that 
is, 
roi = (t — Besr1)er — Ceti Pr-1- 
Since (Ye+1, x) = 0 = (pe, pr) — Berilyr, vr), Bear = Beyi. Since 
(Yr+1, Pk-1) = 0, Ceoi = (TPR, Pe—1)/(Pe—1, Pk-1) = Cri. 

For the converse, we suppose that y,x(a) satisfies the three term recurrence 
(4.23). We need to show that {yx(x)}?29 are orthogonal. Since each vy; (x) is 
of the form x* + {lower-order terms}, all denominators for By, and C; in (4.24) 
are nonzero. We will show by induction on k that ne p(x) pr(x)y;i(a)dx = 0 


fori <k. Fork =1, 

(1, 0) = (0; Yo) — Bi (yo, Yo) = 9. 
Assume now that the result is true for k = n—1. Then, 
(Yn: Yn—1) = ((@ — Bn) Pn—-1; Pn—1) — Cn(Yn—2; Yn-1) 


ca ({pn-1, Yn-1) _ Br(Yn-1; Qn-1) _ Cnr(Yn—2; Yn-1) 
=0 (by the definition of B,,). 


(a = Brn) Pn-1; Pn—2) = Cn(Pn—2; Yn—-2) 
LPn-1; Pn—2) — Cn(Pn—2; Pn—2) = 0 (by the definition of C,,). 


(Pn; Pn—2) = ( 
= ( 
For 1 < n-— 2, 


(pn, Vi) = ((@ — Bn)Yn-1, Pi) — Cn(Pn—2, Pi) = (TPn-1, Pi) 
= (Yn-1, ©Vi) = (Yn-1, Piti + Bizi yi + Cig yi-1) = 0. 


4.3.7 Error in Least Squares Polynomial Approximation 


We now consider the error in the least squares approximation to a function 
f in C[-1,1]. Our first step along these lines is the following observation. 


REMARK 4.22 Necessarily || — pn||z2 — 0 as n > oo in the L?-norm, 
where py, is the least squares approximation of degree n to f € C[—1,1]. This 
follows from the Weierstrass Approximation Theorem since given € > 0 there 
exists a polynomial p such that || f — plloc < €/V2. Hence, provided that n is 
sufficiently large, 


i 1/2 
ifm SF =4i8 = (/ (Fc) - pf) ar) 
< Ilf — llov2 <e. 
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Now let’s consider || f — pp|loo. We have: 


THEOREM 4.9 

Suppose that f € C[—1,1] and p,(a) is the least-squares approximation to 
f(a) with respect to weight function p(x) = (1 — 22), é.e., pn(x) is a linear 
combination of orthogonal Chebyshev polynomials. Then 


4 * 
Jf —Pallae < (44 Inn) Ea(f), where En(f) = If ~ lle 
and p* is the best uniform approximation to f, 7.e., 


If — Pilloo < lf —unlloo for any un € P”. 
PROOF See [75]. 


REMARK 4.23 _ If f satisfies a Lipschitz condition 
If(x) — F(2)| < hla — 2| 
for x, z € [—1,1], then it can be shown that 
E,(f) < 6k/n, 
so ||f — pnlloo + 0 as n > oo. Furthermore, if f € C*[—1, 1], then 


Cc 


En(f) < 


oS 


n 


for some constant c independent of n. Thus, if f is smooth, the least squares 
approximation rapidly improves as n increases. In particular, if f € C°([—1, 1], 
the error in the Chebyshev least squares approximation goes to zero more 
rapidly than any finite power of 1/n as n — oo. 


4.3.8 Minimax (Best Uniform) Approximation 


We now briefly study the best uniform approximation. (See [75] for a thor- 
ough treatment of best uniform approximations.) 


DEFINITION 4.9 Let V = C{a,}] be a normed vector space with norm 


ella = max, o(2)) 
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Let W = P” C V be the set of polynomials of degree n or less. The best 
uniform approximation p*(x) to f(a) € Cla,b] (existence is guaranteed by 
Theorem 4.1) satisfies ||p% — flloo < ||[Pn — flloo for all p, € P”, and is called 
the minimax or best uniform approximation to f. It is called “minimax” 
because 


es — Fs = amin, { amas ble) — ste). 


PnEP” |a<a 


DEFINITION 4.10 Let f(x) be defined on [a,b]; the modulus of conti- 
nuity of f(x) on [a,b], w(d) is defined for 6 >0 by 


w(d)= sup |f (x1) — f(2)]. 
@1,x2€[a,b] 


|v1—a2|<6 


REMARK 4.24  w depends on f, [a, 6], and 6, so w(d) is actually short- 
hand for w(f;[a,b];6). Notice that if f is Lipschitz continuous on [a, }], then 
w(d) < Lo. 


Several properties of modulus of continuity follow from Definition 4.10. 


LEMMA 4.1 
If 0 < O41 < da, then w(61) < w(d2). 


LEMMA 4.2 
f € Cla, b] if and only if lim w(d) =0. (f is uniformly continuous on [a,b 


if f € Cla, b}.) 


LEMMA 4.3 
If X>0, then w(dd) < (1+ A)w(d). 


PROOF Let 7 be an integer such that n < \ < n+ 1, so that w(Ad) < 
w((n+1)d). Suppose that |x, — v2| = (n+ 1)6 and say xg > 2. With the 
points z; = v1 + j(w2 — 21)/(n+1), 7 = 0,1,...,n +1, divide [x1, x2] into 
n+ 1 equal parts each of length (a2 — 71)/(n +1). Then 


Thus, w((n + 1)d) < (n+ 1)w(6). This an 


d (n+1) < A+ 1 thus imply 
w(Ad) < w((n + 1)5) < (A+ 1)w(0). 
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4.3.8.1 Error in the Best Uniform Approximation 


We now consider the error in the best uniform approximation. 


DEFINITION 4.11 We define 
En(f; la, 6) = En(f) = If — Prileo 


to be the error in the best uniform approximation. 


THEOREM 4.10 ; : 
E(F5(0.1)) = lf ~ Pall: $ 3» (z) for f€ C0] 
PROOF = Recall that 
” k n 
B,(f;x) = yy (=) i) a (Lay 
k=0 


is the n-th Bernstein polynomial of f. Thus, || f — p¥|loo < || f — Balf;2)|lo- 
But 
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By the Cauchy—Schwarz inequality, 


_ (xiza\"” 
(2) 


(See the proof of Weierstrass approximation theorem, starting on page 205. 
There, we saw that B,(1;z) = 1, Ba(a;x) = x and B,(z?;27) = x2? + 
x(1—.)/n.) Hence, 


= n 3 = 
IF(e) — Balfs2)| <w (n-¥?) E + aS = Su (nt). 
Therefore, 


: So a 
If — Pill < 34 (=). 


REMARK 4.25 Better error estimates than given in Theorem 4.10 can 
be obtained. In particular, Jackson’s Theorem [75] gives: 


(a) if f ¢ C[-1,1], then E,,(f;[—-1,1]) < 6w (=): 


n 


a 


(b) if f € Cla, 6], then E,,(f; [a, ]) < 6w (*) (by (a) and a change of 


variables). 


REMARK 4.26 Suppose that f € C'[-1,1] and |f’(x)| < M on [-1, ih 
Then E,,(f;[-1,1]) < 6M/n. 


PROOF _ (of Remark 4.26) 


Yo (+) = sup |f (a1) — f(x2)| = sup |f’(E(a@1, v2))| lai — va] < —, 


21,02€[—1,1] @1,02€[—1,1] 
|w1—a2|<1/n |z1—a2|<1/n 


M 
n 
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where €(21, 22) € [%1, #2] is the point guaranteed by the Intermediate Value 
Theorem to have the property f(21) — f(a2) = f’(€(@1, x2))(a1 — x2). Re- 
mark 4.26 then follows from Remark 4.25. 


REMARK 4.27 If f € C*[-1,1],n > k, and |f)(x)| < My on [-1, 1], 
then it can be shown that 


En(f;[-1,1]) < My, where c;, is independent of n. 
n 


4.3.8.2 Characterization of the Best Uniform Approximation 


Let e(a) = f(x) — p* (x), where p*(x) is the best uniform approximation to 
f(e) in P®. Thus, llelleo = En(f; (a. 8). 


THEOREM 4.11 
There exists (at least) two distinct points x1, xq € [a,b] such that 


le(x1)| = le(v2)| = En(f;[a,6]) and e(21) = —e(2). 


PROOF The continuous curve y = e(x) is constrained to lie between 
y = tE,(f) for a < x < b and touches at least one of these lines. We wish 
to prove that it must touch both. Suppose it doesn’t touch both. Then, as is 
illustrated in Figure 4.5, a better approximation to f than p* exists. Assume 


FIGURE 4.5: A better approximation to f than p*. 


that e(x) > —E,(f) for all x € [a,b]. Then 


me) =m>-—E,(f) 
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and c = (E,(f) +m)/2 > 0. Since ¢n(x) = p*(x) +c € P”, it follows that 
f(x) — an(x) = e(x) — ¢ and 
—(E 


n(f) 0) =m—e<e(e)—c< B,(f)—e. 


We thus have || f — dn|lo = En(f) — ¢, contradicting the definition of E,,(f). 
Hence, there must be a point 21 such that e(7,) = —E,(f). Similarly, there 
is a point x2 such that e(%2) = E,(f). 


COROLLARY 4.1 
The best uniform approximating constant to f(a) is 


max 
2 |a<ax<b a<a<b 


te Oem se) 


and 


Fol) = 5 | amas, Fe) — an, s(0)) 


a<a<b a<a<b 


PROOF If dis any other constant than pj, then e(x) = f(x) —d does not 
satisfy Theorem (4.11). 


The argument in Theorem (4.11) can be extended as follows. 


THEOREM 4.12 

Suppose that f € Cla,b], p* is the best uniform approximation on [a,b] to f 
from P” if and only if there exists an alternating set consisting of n+2 points 
for the error e(a) = f(x) — px (a), te, there exists 


Aw <2, < 42 < +++ <Any < bd 
such that 
le(zj)| = Il f — Pi lloo 
for 37 =0,1,2,...,n+1 and 


e(x;) = —e(2541) 


for j =0,1,2,...,n. We see this graphically in Figure 4.6. (It can be shown 
that this theorem implies that p* is the unique best approximation to f from 
P"; see [75].) 


The alternating error given in Theorem 4.12 is often called the minimax 
equi-oscillation property, and a variants of Theorem 4.12 are called Cheby- 
shev’s Equi-Oscillation Theorem. Note that Chebyshev polynomials have the 
equi-oscillation property on [—1, 1]. 
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FIGURE 4.6: n+1 alternations of the best approximation error 
(Theorem 4.12). 


REMARK 4.28 In general, it is impossible to find in a finite number of 
steps the best uniform approximation p* € P” to a given function f € Cla, )] 
in the uniform norm. A strategy used is to replace [a,b] by a finite set of 
distinct points in [a, b] and solve the approximation problem on this finite point 
set. The Exchange Method of Remez or Remez Algorithm is a computational 
procedure for finding p*(x) on a finite point set. See, for example, [74]. 


REMARK 4.29 Since minimax approximations are difficult to find, near 
minimax approximations are useful. Consider the following two near minimax 
approximations. 


1. Let f € C[-1,1]. We saw that if 


Pn (z) = S- an Tj (x), 


i=1 


where T;(x) is the i-th Chebyshev polynomial and p,,(x) is the least 
squares approximation to f(x) with respect to weight function p(a) = 
(1 — 2?)~2, then 


If polls < [4+ Inn] BAN), 


where E,(f) = || f —p% loo and p* is the minimax approximation. Thus, 
|| f—p%,||0o decreases roughly in proportion to E,,(f), and p(x) is called a 
near minimax approximation. (Note that g(t) € Cla, b] can be converted 
by change of variables to f(a) € C[—1, 1].) 


2. Consider the Lagrange interpolating polynomial p,,(a) with interpolat- 


ing nodes 2; = cos (24 5), 0 <i<n, on [—1,1]. We saw that 
n+1 2 


if n —n 
I~ Balloo < pay F OO No”. 


236 Classical and Modern Numerical Analysis 


This interpolating polynomial is sometimes considered a minimax ap- 
proximation. 


4.3.9 Interval (Rigorous) Bounds on the Errors 


Whether we are considering Taylor polynomial approximation, interpola- 
tion, least squares, or minimax approximation, the error term in the approx- 
imation can be put in the general form 


f(x) = p(a) + K(x2)M(f;x) for x € [a, dB. (4.25) 


We list K and M for various approximations® in Table 4.2. In such cases, 


TABLE 4.2: Error factors K and M in polynomial approximations 
f(t) = p(x) + K(@)M(f; 2). 
Type of approxima- 
tion 


degree n Taylor poly- 
nomial 


polynomial interpola- 
tion at n+ 1 points 


least squares approxi- 
mation of degree n in 
the Chebyshev norm, 
f EC [a,b 


pand K can be evaluated explicitly, while M(f;x) can be estimated using 
interval arithmetic. We illustrated how to do this for f(z) = e”, using a 
degree-5 Taylor polynomial, in Example 1.17 on page 28. We elaborate here: 
In addition to bounding particular values of the function, a maximum error 
of approximation and rigorous bounds valid for all of [a,b] can be inferred. 
In particular, the polynomial part p(x) is evaluated at a point (but using 
outwardly rounded interval arithmetic to maintain mathematical rigor), and 
the error part is evaluated with interval arithmetic. 


’The error of approximation of smooth functions by Chebyshev polynomials can be much 
less than for nonsmooth (merely C°) functions, as is indicated in Remark 4.27 combined 
with Theorem 4.9; however, bounds on the error may be more complicated to find in this 
case. 
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Example 4.10 
Consider approximating sin(x), x € [—0.1,0.1] by a degree-5 


1. Taylor polynomial about zero, 
2. interpolating polynomial at the points x, = —.14+ .04k,0<k< 5. 


For the Taylor polynomial, we observe that the fifth degree Taylor polynomial 
is the same as the sixth degree Taylor polynomial, and we have 
sin(x) € y eites : “sin(€) for some € € [—0.1,0.1]. (4.26) 
in(a) € a — =x? + —a2? — ——zx' sin r som —0.1,0.1]. (4. 
6 120 5040 ; 
We can replace sin(€) by an appropriate interval to get a pointwise estimate; 
for example, 


053.055 —.057 
in(005)\ S03 ee orig 98 
se ) 6 + 120 ~ 5040! 


C [0.049979169270821, 0.04997916927084], 


where the above bounds are mathematically rigorous. Here, K was evaluated 
at the point x, but, sin(§) was replaced by sin([0.0.05]). Similarly, 
? (-.01)? (—.01)®  (-.01)" 
3 OL .01) — ———— —0.01 
sin(—0.01) € (—.01) 6 + 120 5040 {[—0.01, 0] 
C [—0.00999983333417, —0.00999983333416]. 


Thus, since we know sin(x) is monotonic for x € [—0.01, 0.05), 
[—0.00999983333417, 0.04997916927084] represents a fairly sharp bound on 
the range {sin(x) | x € [—0.01,0.05]}. Alternately, it may be more convenient 
in some contexts to evaluate K and M over the entire interval, although this 
leads to a less sharp result. Using that technique, we would have 


; 053 05° [-0.1,0.1]? 
sin(0.05) € .05 — reo) aon + Seago 


05% 05° 
Ce. 0 


[—0.19841269841270 x 10~ +, 0.19841269841270 x 10717] 
[(0.04997916926884, 0.04997916927282], 


[—0.1, 0.1] 


IN 


IN 


and 


(=.01)3 | (-.01)> — [-0.1, 0.1)" 


= 3 a 5 


[—0.19841269841270 x 1071, 0.19841269841270 x 10717] 
C [—0.00999983333616, —0.00999983333218], 
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thus obtaining (somewhat less sharp) bounds 
[—0.00999983333616, 0.04997916927282| 


on the range {sin(a) | x € [—0.01, 0.05]}. 
In general, substituting intervals into the polynomial approximation itself 
does not give sharp bounds on the range. For example, 


sin({—0.01,0.05]) € ({-.01, .05]) — (or, 0°)" + (01, 06)" ™ 


[—0.19841269841270 x 1071, 0.19841269841270 x 10717] 
C [—0.01002083333616, 0.05000016927282]. 


Nonetheless, in some contexts in which there is no alternative, this technique 
gives usable bounds. 

Computing bounds based on the interpolating polynomial is similar to com- 
puting bounds based on the Taylor polynomial, and is left as Exercise 22. 


Moduli of continuity, as well as Lipschitz constants, can be easily estimated 
when f € C"[a, b]. In particular, in such cases, 

w(f; [a,b];4) < f(a, 8) 6, (4.27) 
where f’({a, }]) is an interval evaluation of the derivative f’ over [a,b] (or any 
other set of bounds on the range of f over [a, }]). 

We now return to the concept, analogous to composite integration, of di- 
viding the interval of approximation into subintervals, and using a different 
polynomial over each subinterval. 


4.4 Piecewise Polynomial Approximation 


Piecewise polynomials are commonly used approximations. They are easy 
to work with, they can provide good approximations, and they are widely used 
in computer graphics. In addition, piecewise polynomials are employed, for 
example, in finite element methods. Good references for piecewise polynomial 
approximation are [52] and [80]. 

A simple type of piecewise polynomial is approximation by a line segment 
in each subinterval. 


4.4.1 Piecewise Linear Interpolation 


DEFINITION 4.12 — Given a partition 
A:a=% <2 <+++ << Up_1 <n =O 
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of [a,b], the set La of all continuous piecewise linear polynomials on |a, b] 


with respect to A is 


La = {y(x) € C[a, b] : p(x) is linear on each [2;, 7:41],0 <i < N-—1of A.} 


Graphically, » € LA may appear, for example, as in Figure 4.7. 


FIGURE 4.7: An example of a piecewise linear function. 


PROPOSITION 4.5 
La is an (N + 1)-dimensional subspace of C|a, 0]. 


PROOF La is a subspace of C[a,b], since La is closed under addition 
and scalar multiplication and 0 € La. To complete the proof, we find a basis 
of N+1 functions. This basis for La consists of the well-known hat functions 
pi(x), 0<i< N, defined by 


4 Oe ees Os 

—, StS, 
yo(x) = 4 t1 — Xo 

0, otherwise, 


wv — ILN-1 


d 


UN-1 LU San, 


pn(“) = « CN—-IN-1 
0, otherwise, 
LT — X-1 
——, %15 25%, 
: = Lj Vi-1 : 
Gilt) = 4a. a forl<i<N. 


LLU LV 41, 


, 


Tj41 — Xj 
These hat functions are depicted graphically in Figure 4.8. Notice that 
yi € La fori =0,1,2,...,N, 


ee 
A J= bj; = ? 
Pils) = 9:5 a 
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y 


FIGURE 4.8: Graphs of the “hat” functions ; (2). 


We will show that {y;}_, spans La, that {y;}4, is a linearly independent 
set, and thus that {oi} forms a basis for La. 


N 
Linear Independence: Let S> c;y;(x) = 0 for every x € [a,b]. Setting x = aj, 
j=0 


0<i<_ N, we conclude that c; = 0,0<i< N. Thus, foe is a linearly 
independent set. (Recall that {y;(x)}%_> is a linearly dependent set on an 
interval if and only if there are constants ko, ki, ..., kn, not all zero, such 


N 
that >> kiy;(x) = 0 for all x on the interval.) 
i=0 


Spans La: Let f(a) € La and let fi; = f(ai),O<i< N. Then f(a) = 
N 
> fy; (x), since the right side coincides with f(z) attr = 24;,,0<i< N, 
j=0 


and is linear in each subinterval [2;, 2:41]. (This also shows that {y;}Mo is a 
collocating basis.) 


PROPOSITION 4.6 


Given f € Cla,b], there is a unique ® € La which satisfies f(x;) = ®(a;), 
re Wy 


PROOF Define ®(x%) = = f(x;)~;(x), where the y;(a) are the basis 


functions defined after the na of Proposition 4.5. Clearly, ® € La and 
®(x;) = f(a), O< i < N. Moreover, if two different such ®’s existed, say 
®, and ®2 with @)(2;) = ®o(z;) = f(a;), 0 < 7 < N, then the piecewise 
linear function ©, — 2, being zero at x;, 0 < 7 < N, would have to be zero 
everywhere on [a, 6] (since the linear interpolant at x; and xj; is unique). 


DEFINITION 4.13 We call ®, defined in Proposition (4.6), the La - 
interpolant of f, and denote it by ®(a) = In f(a). 
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N 
REMARK 4.30 Iv f(x) = > f(2;)9;(2) is easily obtained. 
j=0 


REMARK 4.31 In : C[a,b] — La is a linear operator, i.e., 


Ty (afi(z) + bfo(x)) = aly fi(x) + bIn fa(z). 


We now wish to estimate the error in piecewise linear interpolation, i.e., 
If — In flloo = max |f(x) — In f(2)]. 
x€ [a,b] 


Notation 4.1 Leth= max 
0<i<N-1 


(tig1 — 24) and D? f(x) = f(a). 


We have 


THEOREM 4.13 
Let f € C?[a, 6]. Then ||f —Inflloo < Zh? ||D? flloo- 


PROOF 


ea eee anes It(e) — In f(a) 


O<Si< N-1 | aj <a<aj41 
=max | max f(x) - (Hey 9) + f(vi41) ie =) ; 


(4.28) 
where h; = 4;41 — x;. By Taylor’s Theorem, 


Xi 


fli) = fee) +7 @er—at fw —Hs" Oat 


Fleiss) = f(e) + F@\(en—a) + | (aia — t)f""(t)at. 


x 


Substituting these expressions into (4.28) gives 


If —Inflloo = max max 


FHS | ‘(aj — t)f" (tat 


Thus, by the triangle inequality and taking ||D?f'||. outside the integrals and 
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the maximum, 


If — In flleo 


< ||D? filo max max 
t wySe<Svi41 


: | (t — x;)dt + ale / (t — vj41)dt 
ie he he eed 


a 


aX 358 Lisi -2(e@—a4)* 2-2; (2j41 —2)" 
SIAC id iene a | hi 2 hi 2 


= ||D?fllocmax max [Gein — 20-29) 


ai<a<aiys | 2 
1 h? 
_ 2 fii theiclce Sate zs 2 
=||D Flloo , max 3 (tit xi) a |D*flloo- 


REMARK 4.32 For f € C?[a, 0], it can also be shown that 

|DUF — Iv Aloo < SAID? flee: 
(See [80].) 1] 
REMARK 4.33 For f € C'[a, 0], it can be shown that 


If — Ly flac < 5AllD flee 


REMARK 4.34 _ For f € C[a, J, it is straightforward to show that 
If — In flloo + 0 


ash— 0. ] 


Example 4.11 

Consider f(x) = Ina on the interval [2,4]. We want to find A that will 
guarantee that the piecewise linear interpolant of f(a) on [2,4] has an error 
of at most 1074. We will assume that |2;41 —x;| = h for 0 <i < N-1. Then 
h2 


1 
2) = 97 * 
Pel 32 — Y 


h2 


— max 
8 2<a<4 


2 
If — Inv flleo < “-ll"lloo = 
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Thus h? < 32 x 1074, h < 0.056, giving N = (4 — 2)/h > 36. ] 
Although hat functions and piecewise linear functions are frequently used 
in practice, it is desirable in some applications, such as computer graphics, 
for the interpolant to be smoother (say C1, C?, or even higher) at the mesh 


points x; of A. Special piecewise cubic polynomials, which we consider next, 
are commonly used for this purpose. 


4.4.2 Cubic Spline Interpolation 


DEFINITION 4.14 = The set of cubic splines on [a,b] with respect to the 
partition A is defined as 


Sa = {y(x) € C?[a, b] : v(x) is a cubic polynomial on each subinterval 
(xj, © j41],0 <j < N- 1, of A}. 


REMARK 4.35 y € Sa has C? smoothness, ie, y € Sa has two 
continuous derivatives at each point, including the mesh points. 


We first develop a basis for Sa. For convenience, we assume here a uniform 
mesh, i.e., £741 — 2; =/h for all 7, and let 


0, xu > L542 


3 
Bn (tit ae Lj41 LLL Uj +2, 


8j(z) = - 53 (tit 2), @;<e<aj41, (4.29) 
2 1 1 
3 72 t ie) ons | a Lj1 LLL 4;, 
1 
ons > rj-2)° Lj-2 < x < Tj-15 
0, @< Xj-2. 


DEFINITION 4.15 = The function s;(x) defined by (4.29) is called a B- 
spline centered at x = x; with respect to the partition A with a uniform mesh. 


REMARK 4.36 _ It is straightforward to show that s/(z) and s/(x) are 
continuous, so s;(x) € Sa (Exercise 23 on page 287). 
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We now introduce two extra points _; = % —h and &n41 = «tn +h and 


also consider the B-splines s_;(a) and sy41(a) centered at v_1 and ry 41. 
Now let 


0, u<a, 
Gia) =< s(0), oi e<b, > dor —1<7< N+ 1, 
0, z>b, 


The y,’s are depicted graphically in Figure 4.9. 


FIGURE 4.9: B-spline basis functions. 


THEOREM 4.14 
The functions {p;(x) nee form a basis for Sa. 


PROOF First we show linear independence, then we show that 
{95 (2) pha spans Sa. 


N+1 
Proof of Linear Independence: Let > c;yj(x) = 0 for all x. In particular, 
j=l 
N41 N+1 
~~ oj (ti) =0,0<t< N, and 9? cjpij(r~) = 0 fork =O andk=N., 
j=l j=l 
Using the definition (4.29) of the B-spline s;, we see for c = [co,c1,.-.,¢n]~ 
that Ac = 0, c_1 = cy,and cy41 = cny_—1, where 
4 2 0 tee 0 
1 4 1 
0 1 4 1 
A= ‘ (4.30) 
. ee 0 
1 4 1 
0 0 2 4 
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Since |a;j| > s \a,;| for i = 0,1,2,...,.N, we see that A is strictly diagonally 


je 
dominant and hence nonsingular. Thus, c = 0, that is, c_) = co = C1 


cn+1 = 0. Therefore, {y; te +1 is a linearly independent set. 


Proof that {p;(«) phe spans Sia: Given ® € Sa we need to show that 


N41 
x)= YS) cie;(2) 

j=-l 
Consider the interval [x;,2;+41]. The set P3(x,;,2;41) of polynomials of degree 
3 on [%;,%541] is 4-dimensional. Since yj-1, 9), Yj41, Yj4+2 © P3(tj, 2541) 
are linearly independent, they span P3(x;,2j41) and thus form a basis for 
P3(xj,0541). Now let ® € Sa, so ® restricted to (vj, ©j41] is in P3(vj,%541). 
Thus, 


@(2x) = ce?) pj-1(x) + eb p,(x) + c+ cp, (4.81) 


for cE ere 
where the coefficients ene k = —1,0,1,2 may depend on the interval 7. If 
the coefficients can be shown to be independent of 7, then every ® € Sa can 
be written as a linear combination of y,(x), -1 <j < N +1, for x € [a, 0]. 
To show this, consider ®(x) on (541, %j+2] : 

+1 +1 

B(x) = of Mpg a) + HAY vin (@) + ef vj42(a) (4.32) 
+cF tp i43(x) for x € [rj41, 2542). 

If we show that ol) = ey, oie = ens a = ee then the c;’s are 
independent of the interval. Equations(4.31) and (4.32) and C? - continuity 
of ® at x = 541 give 


®(x5,1) = Or te) : 


10) .o2 0) of). = 1 G40 , 2 G4n 1 1 Gt 
69 1 3c +5 gCIt2 = GS git + Boi+2 
®'(x 541) = ®'(x7,1) ; 
EEA oe AO oe On SGD) 
Bis Oh ae 2h? oh? 
® "(@511) = Det) 
Gy Oe A GA 2G gl Ga) 
h2 4 h2 Iti f2-It2 = J a j+1 2 I+2 
Thus, 
14 1 ol) ete 0 
a1 OL | be) et | = 10 
ia Dy Ve eo 0 
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and cl), = oa ] 


G) — Gt) 6G). = i+) 
} a8 CR fa J 


gives C7" = cj kC 


4.4.3. Cubic Spline Interpolants 


Two types of cubic spline interpolant are commonly used: clamped bound- 
ary cubic spline interpolants and natural cubic spline interpolants. 


DEFINITION 4.16 The clamped boundary spline interpolant ®,. € Sa 
of a function f € Cla, b] satisfies 


®.(u;) = f(zi), i=0,1,...,N, 
(c) 4 ®2(20) = f'(xo), 
®.(cn) = f'(ew). 


DEFINITION 4.17 = The natural spline interpolant ®,, € Sa of a function 
f € Cla, b] satisfies 


PROPOSITION 4.7 
Let f € Cla, b]. Then there is a unique clamped boundary interpolant and 
a unique natural spline interpolant of f. (We are assuming here a uniform 


mesh.) 


PROOF _ The proof is constructive, i.e., it describes how the interpolants 
®, and ©,, can be obtained. 


Let 
N+1 


G(r) = S> a9; (2). 


j=-l 


The requirements (c) then lead to the system 


N+1 

S> 9; (aa)aj =f(z:), i=0,1,...,N, 
j=-l 

q-19_1 (20) + 194 (0) = f"(2o), 


qn-1¥'y_1(tw) + 9n41¢y41(tw) = f'(en), 


since y6(o) = yy (tn) = 0. The above system can be written in matrix form 
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as 
4 2 do 6 f (xo) + 2hf' (xo) 
1A ts a 6 f (x1) 
| ae qN-1 6f(tw-1) 
2 4 qn 6f (rn) — 2hf'(zn) 
where 


gd-1=4—2hf'(ao), and 
gn+i = n-1+ 2hf'(ay). 
The system (4.33) has a unique solution tae +1 because the matrix A is 


strictly diagonally dominant. 


Now consider 
N+1 


®,,(x) = S- 85); (2). 


j=-1 


Conditions (n) lead to (exercise) the system 


6 0 0 SO f (xo) 
ct A. 24 $1 f (21) 
=6 ; (4.34) 
1 4 1 SN-—1 f(@n-1) 
0 0 6/ \sn f(z) 
where 
Sy = —-S8, + 280 
SN41 = —S8nN-1+ 28y 
Clearly, the s;’s are uniquely determined. ] 


We give the following error estimate without proof. 


THEOREM 4.15 
Let f € C*la,b] and let ®.(x) be the clamped boundary cubic interpolant 
(uniform mesh). Then 


5 
f= Gelloo = 3g? IID" Slee: 


Furthermore, 


|D"(f — Be)lloo < ah" LDF hess O<r<3s, 
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where co = 5/384, cy = 1/24, co = 3/8, and cg = 1. 
PROOF See [80]. U 


REMARK 4.37 A similar result holds for natural boundary cubic inter- 
polants. See [14]. ] 


REMARK 4.38 — Similar results hold for a nonuniform mesh. ] 


Example 4.12 
Consider f(x) = Ina. We wish to determine how small h should be to ensure 
that the cubic spline interpolant ©.(x) of f(x) on the interval [2,4] has error 


less than 10~*. We have 
5 6 
= {| — | ( — ]A*< 10+. 
(ss) (5) = 


Thus, h* < (1/30)(384)(16)10-4, h < 0.38, and N > 2/0.38 = 6. (Recall 
that we required N > 36 to achieve the same error with piecewise linear 
interpolants.) 


5 5 6 
* < —_p4ll p4 — 2 74 a 
lf —®Belloo < ana” |D* floc 384° pease Pz 


REMARK 4.39 Satisfaction of ®/ (20) = f’(%o), ®.(an) = f'(an) may 
be difficult to achieve if f(x) is not explicitly known. Approximations of order 
h* can then be used. Examples of such approximations are: 


(00) = sz | — 254 eo) + 48F (0 +h) 
— 36f (xo + 2h) + 16f (xo + 3h) — 3f (x0 + 4h) 


+ (error), 
4 


h 
where (error) = mei te) to <E <a + 4h, 


f'(an) = orl 25f (en) —48f (an —h) + 36f (an — 2h) 
— 16f (xy — 3h) + 3f(ay — 4h) 
+ (error), 


where (error) = —fO(é), ay <€< ay —4h. 


REMARK 4.40 __ It can be shown that if u is any C?-function on [a,b] 
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such that u interpolates f in the manner 


u(a;) =f(a), O<i<N, 
u'(xo) = f' (xo), 
u'(zn) = f' (tn), 


then 


That is, among all clamped C?-interpolants of f, the clamped spline inter- 
polant is the smoothest in the sense of minimizing ie (u"(x))? de. 


We now turn to a type of approximation that has played a large role in clas- 
sical applied mathematics, and is also important in modern signal processing 
technology, such as in storage and transmission of audio and video signals. 


4.5 Trigonometric Approximation 


Several good references on least squares trigonometric approximation and 
trigonometric interpolation are [31], [35], and [49]. 


4.5.1 Least Squares Trigonometric Approximation (Fourier 
Series) 


Let V = C[0, 27] and 
Wor = span(w_e, W—k+1;,--+-,W0,W1,--- , Wk), 
where w;(x) =e” andi = /—I. Let 


27 


g, 9) = A f(x)g(x)da, for fg e C(0, 27]. 


Then, Woz is a subspace of the inner product space V. It is straightforward 
to show that (w,;, we) = 27dj¢. Thus, {w_x¢, w—e41,---, Wo, W1,---, We} is an 
orthogonal basis for W2,, and the best (least squares) approximation g, € Wor 
to f € V has the form 


k 20 
7 1 - 
gr (x) = > ee where a; = on Ih f(aje"* dx. (4.35) 
j=- 
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DEFINITION 4.18 


2 1 27 = 
£).= S- aje’", where a; = a f(xje"Y* dx 
0 


j=—00 


is the complex Fourier series of f(x), defined for 0 <a < 2n. 


REMARK 4.41 The complex Fourier series can be readily transformed 
to the standard form of Fourier series. Consider 


aS oye 09 +30 aje4* +a_ je”) 
j=—k 
k 
a (a; + a_;) cos jx + i(aj; — a_;) sin jz) 
k 
ot (a; cos jx + b; sin jx), 
where 


aj =a;+a_j;, and 


b; =(a;—a_;)t for j =1,2,...,k. 
In addition, if ao, {a;}h_ 1, and {bj}4_4 are given, then 
Qo = ao, 
1 ; 
aj = 5 (4 —ib;), and 
aj = 3 (aj + ty) for f= 1,.2) sack: 


U 


We begin our study of trigonometric approximation with a well-known re- 
sult for Fourier series. To state this result, we first give 


DEFINITION 4.19 Leta=% <4, <-+-:+<%, = 0b be a subdivision of 
[a, b]. Define 


t= > Nf) — f(w-1]. 


If supt < co, where the supremum is taken over all subdivisions of [a,b], then 
f is said to be of bounded total variation. Notice, for example, if f € C*{a, bj, 
then f is of bounded total variation. 
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THEOREM 4.16 
If f(x) is piecewise continuous on [0,27] and has bounded total variation, 


then 
1 


Goo(x) = slf(a*) + f(e7)) 


for0 <a < 2m, and g(x) is repeated periodically outside the interval [0, 27]. 
In particular, 


Geo 0) = Gao 2m) = 5[F(0*) + f(2R”)]. 


PROOF See [35]. 


REMARK 4.42 For the interval [0,7] (rather than [0,27] as above), the 
Fourier sine series of a function f(x) defined for 0 < x < 7 is the function 


[o.e) 9 via 
gs(a) = So ae sinfz, where ag= = f(x) sin Cada. (4.36) 
T JO 
f=1 
The Fourier cosine series of a function defined for 0 < x < 7 is 
go(x) = S- ag cos £, x, (4.37) 
£=0 
where 


1 /* Qf 
ao = ~ | f(a)dx and a= - | f(x) cos ada for € > 1. 
T JO T JO 


By using the change of variables & = «—7 in (4.35) and extending f evenly 
or oddly to [—7,0], it follows from Theorem 4.16 that if f(a) is piecewise 
continuous and of bounded variation, then 


gs(2) = go(2) = 5[Fla*) + Fle") 


for0 <a <7. Furthermore, gs(—x) = —gs() and gc(—2) = gc(x) for —7 < 
x <0, and gs(x) and gc(x) are extended periodically outside x € [—7, 7]. 


We now have the following interesting result concerning the convergence of 
the least squares trigonometric approximation g,(x) to f(x). 


THEOREM 4.17 
Suppose that f € C"(0, 27], n > 2, and 


f®) (0) = f) (27) 
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for p=0,1,...,n-—1. Then 
lI9% — flloo = O (1/k"~").. 


In particular, if f(a) is infinitely differentiable and periodic, then gx(x) con- 
verges to f(x) more rapidly than any finite power of 1/k as k > co. 


PROOF By (4.36), then using integration by parts, we obtain 


1 27 a 
ay= on : f(xjeY* dx 
20 
Lp(ae*]| + ste f reyetina 
= x)e x)e* da 
27717 i 2717 Jo 
1 _ (n) —ijx é ‘ 
Thus, aj = ———— f\" (a)e“%" dx by repeated integration by parts. 
2n(ij)” Jo 
Therefore, 
1 ar Ge * 
lak sso fh |f™@] Jer | ae 
~ 2alj|" Jo | | 
< max If) (a)| ae for j = 0,+1, £2,. 
~ (OS @<2m gl la 
where Cp = mes |f"(x)|. By Theorem 4.16, f(z) = go(z)= > aje’*. 
<a<2nr j=—00 
Thus, 
k lee) 
= =< lg e 98 
Ilf Ftlloo = Max De aje Ds ase 
j=—k j=—oo 
= edt : 
=p ei Dla 
~ Wg] > |j|>k 


a 
< 2c, y po 
jt 
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Consider 
j=k+1 J” j=l Ee ke j=l (3/k + 1)" 
2k 
1 1 1 
7 Kr aaa Ss 7 n 
i | 2 Gyktip * 2+ Glk+D 
3k 1 
OR carers as 
j=2k41 (G/k + 1) 
Fp eine ogee Weer | Cente eee 
dn = dl ore 
~ eae where dp, = SS — and dp is finite for n > 2. 
j=l 
2cendn 1 
Hence, ||gx — flloo < ei -0 (<5). 


Example 4.13 

The L?-errors, i.e., ||gx — f\l2 for trigonometric approximations to various 
functions on —1 < x < 1 appear in the following table, where f,; € C/—1, 1], 
fo € C7[-1,1], fg € C*[-1, 1], fa € C[—-1, 1], and fy is periodic. 


file) = fala) = 0! | fle) = | fl) = 
0.053 0.22 0.014 0.045 
. 0.00054 
0.0023 0.74 x1077 


0.00083 a 


For comparison, the calculated L?-errors for piecewise linear (PL) interpolants 
and cubic spline (CS) interpolants with M basis elements are given in the 
following table. 


000019 | 088 | .000071 | 0.10 
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4.5.2 Trigonometric Interpolation on a Finite Point Set (the 
FFT) 


Trigonometric interpolation is often useful for interpolation of large amounts 
of data when the data are given at equally spaced points. In the trigonomet- 
ric interpolation problem, we have N values of f(x) on the interval [0, 27] 
at equally spaced points x; = 27j/N, j7 = 0,1,...,N —1, ie., (xo, f(xo)), 
(v1, f(a1)), -.., (@n—1, f(@n-1)), where f(0) = f(2z7), ie., f is periodic with 
period 27. We wish to find the trigonometric polynomial 


FIGURE 4.10: The graph of a trigonometric polynomial p(x). 


First, consider g(a) = a,e"”, the best approximation to f(x) € C[0, 27] 
j=0 
in the inner product space with (v,u) = _- u(x)0(x)dx. (We assume that 


f(a) is known.) Then, by (4.35), we know that 


1 7 27 


(x)e%* da. (4.38) 


Let’s approximate the integral in (4.38) using the composite trapezoidal rule, 
i.e., 


: —a a 
[ fetes SOROS to), 
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where 7; =a+ jh, h=b—a/N. Then, 


1 ign, 207 20 | 1 “ite 1 isn 
Op 2. Flame + ae | af (tole % + 5 flanje 4™ 
N-1 
1 ss 
ae LL, Flame", (4.39) 


assuming that f(0) = f(%o) = f(an) = f(27), i.e., f is periodic. We will see 
that 


fae ts (4.40) 


N-1 |. 
where p(x) = >> ce” is the interpolating trigonometric polynomial. Thus, 
j=0 


the interpolating trigonometric polynomial is closely related to the best ap- 
proximating trigonometric polynomial in the inner product space. 
To show that the c; satisfy (4.40), we need the following lemma. 


LEMMA 4.4 
N-1 : : 
es 3 ete gilt, — 1 as = Js 
N 0 Oif lA J, 


k= 
where x, = 2rk/N and0<é,j << N-1. 


PROOF 
N-1 N-1 N-1 
1 fe : 1 (pi) 2k 1 
=o ja, put, _ S Hl= ayes k 
WN e€ Ber f= N € Ne N A ’ 
k=0 k=0 k=0 


where A = e~*(4-3)27/N | Tf = 7, then A = 1 and thus 


where the last equation follows from the fact that e~?"’" = 1 for any integer 
mM. 
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We now have: 


THEOREM 4.18 
The exponential polynomial 


N-1 , Na 

xr) = » cer" with cp = WV S- f(anje!** 
j=0 =0 

N-1 


interpolates the points {(i, f(@i))}i_o - 


PROOF 
= ie ete eta s28 
p(x) = S- een a a f (ap)e 97" ee 
j=0 j=0 k=0 
N-1 1 N-1 N-1 a N-1 
= S- f (xr) W ete pid ty Sy ff fla pe eid (v—k)2n/N 
k=0 j=0 k—0 N= 
N-1 1 N-1 N-1 
= S- f (xr) 7 e tke; eli fl Lh) Vea 
k=0 j=0 k=0 


=f(a,) forv=0,1,2,...,N—1. 


REMARK 4.43 — Using Euler’s identity, for real data f(z;) € R for j = 


N-1 |. 
0,1,...N—1, p(x) = SO cje”%* can be written in terms of a series of real 


functions cos jx and sin jx (Exercise 29 below). 
REMARK 4.44 = To obtain 
1 —ijx : 
ej = WD flee ite 7 =0,1,2,...N—1 


directly requires O(N?) operations, which is prohibitive for N large, say 

= 16384. Fortunately, in 1965, Cooley and Tukey developed the Fast 
Fourier Transform (FFT) [19] for this purpose. This algorithm requires only 
O(N logs N) operations. For example, for N = 16384, N? = 2.684 x 10° and 
N logs N = 2.2938 x 10°. The operation reduction in Fast Fourier Transform 
results from calculating the c; in clusters. The FFT has enabled such modern 
technologies as compact disk audio, DVD video, storage and transmittal of 
sounds and video over the internet, and digital television signals. 
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4.5.3. The FFT Procedure 


Suppose N = 2” for some positive integer r. (This is not necessary, but the 
method is most efficient if N = 2", and some software requires this condition.) 
Let M = N/2=2"-!. Then, for 7 =0,1,2,...,.M—1 


y] 


1 2Mz1 
OG say » f(xg)e %** and 
2M-1 
7 _ d —ijx, ,—iMa, 
Ci+M = 3a Soo fees ; 
k=0 


where x, = (27k) /(2M) = (ak)/M. Thus, 


2M-1 
Cy + Cj4M = om 2 f (ane (1 +e"), 
but 
: 2 if k is even 
T+ —intk zo ’ 
ener) {2 if kis odd, 
Hence, 
2M-1 1 M-1 
Cj + Cj4M = 7G SS fee eS i F(aor)e%7*"/M (4.41) 


x 
ll 
° 
cas 
ll 
j=) 


for 7 =0,1,...,M —1. Similarly, 
Cj-Cj4M == aK f(worpije BOHD7/M for 7 =0,1,2,..., M1. (4.42) 


The coefficients c; and cj4.4 for 7 = 0,1,...,M —1 can be recovered from 
(4.41) and (4.42) with O(M7) operations rather than the O((2M)?) required 
in direct calculation of the c;’s. Furthermore, letting, for 7 = 0,1,2,..., M—1, 


, v2 - 
oer = Ti S- f (xox )e~ PT /(M/2) 
k=0 
; sf (4.43) 
pe = aa fee Tee ero), 
k=0 
we see that ae and ear are of the same form as 


2M-1 


Jat ikh, —ijkn/M 
G= a d f(arje ; 
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Hence, the same procedure can be repeated on en and cm If this process 
is repeated (n +1) times, it can be shown that the total number of operations 
is proportional to N log, N. Schematically, 


Tevell J level? [Level] ] level 71 
N constants | N/2 constants | N/4 constants = constants 
N/2 constants | N/4 constants 
N/4 constants (2” rows) 
N/4 constants * constants 


O(N?) ops. 2”O((3¢)?) ops. 


REMARK 4.45 = At each level, N constants are present. To go backward 
from Level (i + 1) to level 7, i-e., to find the constants at each level, requires 
O(N) operations per level. Thus, since r levels are present, the total number 
of operations is proportional to rN or N logs N. 


REMARK 4.46 The branch of applied analysis dealing with Fourier series 
is called harmonic analysis. The term is connected to the analysis of music 
and audio signals. For example, in a playing violin, a Fourier analysis of the 
function representing the air pressure produced by a vibrating string contains 
a primary frequency, then harmonics, representing twice the frequency, three 
times the frequency, etc. The function representing the vibration can be 
represented directly as a function f(t), where t is time, or as the coefficients of 
its Fourier series {c;}9°_.,. The time representation is called a representation 
in the time domain, while the representation in terms of the Fourier series 
is call the representation in the frequency domain. Replacing {cj }7< by 


{es}E _n» Or, more generally, dropping a selected set of coefficients c;, is an 


example of filtering. 


So far, we have studied approximation by polynomials, by piecewise polyno- 
mials, and by trigonometric functions. Polynomials and piecewise polynomials 
are appropriate for approximating bounded functions on closed intervals, while 
trigonometric functions are especially appropriate for approximating periodic 
functions. We next study rational functions, appropriate for approximating 
bounded functions on infinite intervals and functions with poles at points a 
(such as f(x) = e”/(a — a)), as well as for more efficient approximations of 
continuous functions on closed intervals. 


Approximation Theory 259 


4.6 Rational Approximation 


We now study approximation of functions by quotients of polynomials. 
4.6.1 Introduction 


DEFINITION 4.20 Let R(L,M) denote the set of rational functions 
r=r(a) of the form r(x) = p(x)/q(x), where p € Pr and q € Pw, 1.e., p and 
q are polynomials of degree at most L and M, respectively. 


We assume that r(x) is irreducible, that is, that p and g have no common 
nonconstant factors. We consider in this section best rational approximants, 
Padé approximants, and interpolating rational functions. Good references for 
rational approximation include [17] and [75]. 


4.6.2 Best Rational Approximations in the Uniform Norm 


THEOREM 4.19 
If f € Cla, }], there exists an r* € R(L,M) such that || f —r*|loo < ||f —Tllo 
for allr € R(L,M), where || - ||oo is the uniform norm on C{a, }]. 


PROOF See [17]. (Note that R(L, M) is not a finite-dimensional subspace 
of Cla, b] so we cannot use the theory that we developed earlier.) 


REMARK 4.47 An iterative procedure is used to compute r*, since r* 
cannot generally be found in a finite number of steps. Cheney has provided 
an early survey of such methods [16]. 


4.6.3 Padé Approximation 


Padé approximants are the rational function analogs of Taylor polynomial 
approximations. Padé approximants are useful in many areas, such as in 
numerical methods for solving ordinary differential equations. Rational func- 
tions are particularly useful when approximating functions with poles or when 
finding limiting values of functions as 7 — too. 

Assume that f ¢ C’+™+1/_8, b], and suppose that p(x) and q(x) are pol- 
ynomials of degree at most DL and M, respectively, i.e., p € Pr and q € Py. 
Let 


L 


M 
p(x) = So pix" and q(x) = S- qx. 
j=0 


i=0 
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DEFINITION 4.21 The L, M Padé approximant to f(x) is 


p() 
r(x) = —~, — such that 
(7) q(x) 
f (0) =r (0) fork =0,1,..., L+M, and 
q(0) = qo = 1. 


(If r(x) is irreducible and q(0) = qo 4 0, we can divide the numerator and 
denominator by qo so that q(0) = 1.) 


REMARK 4.48 If M = 0, then r(z) is the Taylor polynomial approxi- 
mation to f(x) of degree at most L. 


REMARK 4.49 The L, M Padé approximant as defined above may not 
exist.2 Consider, for example, f(z) = 1+ 27, M =1, and L = 1. Then 
f(0) = 1, #"(0) = 0, f"(0) = 2, and r(x) = (mo + pi2)/(1+ a2). f(0) =1 
gives po = 1 and f’(0) = 1 gives p; = qi. Hence, r(x) = (14+-piz)/(1+pix) =1 
but r”(0) = 0 4 f”(0). Thus, in this example, the 1,1 Padé approximation 
does not exist. 


PROPOSITION 4.8 

Suppose that f ¢ Ch+™!*1|_b, b]. The L,M Padé approximant r(x) exists if 
and only if the coefficients p;, i = 0,1,...,L and qj, j = 1,2,...,M satisfy 
the equations 


 f(9(0) 
ral 


G-e=p, for i=0,1,2,...,. M+, (4.44) 
£=0 


where pj =0 fori > L, qj; =0 for 3 > M, and q = 1. 
PROOF — Suppose first that r(x) exists. We can rule out denominators 


q(x) that vanish at x = 0, since otherwise r() is either reducible or is infinite 
at « = 0. We thus have qo = 1. Now consider 


g(x) = (f(x) — r(x) a(x) = f(x)a(x) — p(a). 
Since f(0) — r)(0) = 0 for k = 0,1,2,..., M+ L, it is straightforward 


to show that g‘”)(0) = 0 for k = 0,1,2,...M +L. This implies that g(x) = 
aM@+"*+1Q(zx), where Q(x) is a continuous function of z, i.e., g has a zero of 


°Other definitions for which the L, M Padé approximant does exist are sometimes used. 
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multiplicity M+L+1 at x = 0. (For example, suppose that g € C1[-6,)] 
and g(0) = 0. Let g’(x) = v(x). Then 


so g(x) = xQ(x), where Q(x) = + i (s)ds is a continuous function of x.) 
Now, 
M+L L . 
g(x) = f(x)q(x) — p(x) = > wg oS pat + Raa); 
£=0 i=0 
where 
f%O) 
el 
and 


FO a cae FMELAD (pdt = fOFE+D (6) gee ete 


(M+ L)! (M+L+41)! 
for some € between 0 and x. Thus, for g(x) = 2“/+4+*+1Q(z) to be true, we 
must have 

M+L M L 
tts pte 
ye S- agqyu. "7 — So pit =0 for powers of x upto L+M. (4.45) 
£=0 j=0 i=0 
M+L M 
But > raga" 
£=0 j=0 
M M+L M M+L4j 
= ys aeq; aL S- aj—;qjv' (letting i= +3) 
j=0 @=0 j=O i1=39 
M+LM+L 
— Be ai—j;qje' for powers of x up to L + M (since qj = 0 for j > M) 
j=0 ij 
M+L i 


Ai— jGx switching sums (see the figure below). 
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j 


Thus, by (4.45), 
pi= >> a5 fori=0,1,...,.M+20 
j=0 
(by equating like powers of x). Thus, by the change of index i — j = @, 


= £0) 
Dim ) ait or =>. nui? 


£=0 


fori=0,1,...,M+L. 
Conversely, given that (4.44) is satisfied, it is straightforward to show that 
r) (0) = #0) fork =0,1,..., M4. 


Example 4.14 
Find the 1,1 Padé approximant of f(x) = e*. 
Solution: gg = 1, f(0) = 1, 


x wait = Pi 
£=0 ~ 


for i = 0,1,2. Thus, for 7 = 0, p) = 1. Fori=1,q.+q0 =p. Fori= 
gatqat 440 = po, but qo = po = 0. Solving, p. = 1, m = -3, and p; = 
Hence, 


2, 
1 
1 
é 1+a/2 
CONE) ari 


Table 4.3 compares this approximation to the second degree Taylor polynomial 
at several points. 


We now have the following interesting result: 


PROPOSITION 4.9 
If f(x) has L+ M+1 continuous derivatives in [—b, b] and the L, M Padé ap- 
proximant p(x)/q(x) exists, then the (L+ M)-th degree Maclaurin polynomial 
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TABLE 4.3: Comparison of the Padé 
approximant of Example 4.14 to a degree-2 Taylor 
polynomial 


-1/2 | 0.600 0.625 0.6065 


of f(a)q(x)s(ax) is equal to p(x)s(x), where s(x) is an arbitrary polynomial of 
degree M. 


M 

PROOF Let s(x) = > 5,2" and let v(x) be the (L+ M)-th degree Maclau- 
i=0 

rin polynomial of f(«)q(x)s(a). Then 


pee (3) aI 
v(x“) = [fal )s( — 
= x00) 
j=0 
L+M j ‘ 
ne (i) gd 
= YO (2) ofe@aw] 2,5 
; m9 
j=0 i=0 
<5 J ; i\ a 
= j—t £ i-e 
2 2) (0) S> fO Oa! oT) 5 
j=0 i=0 t=0 
Des f.(0)q@-9 (0) 89-9 (0) iow 
= : q § ee OO 
j=0 i=0 £=0 (j — i)! (¢ — 2a! 9! 
7 cso a fOO)q&9 (0)s9-? 0) 
= (44 — P\V(g7 — 7)! 
j=0 i=0 £=0 ala ae) i) 
sy Oye 
_ i-L— By, 
! —j7)! 
j=0 i=0 2=0 f Qj i) 
IT+M 93 
as Sie (from (4.44)) 
j=0 i=0 
= p(x)s(z), 
since p and s are L-th and M-th degree polynomials, respectively. [] 


REMARK 4.50 The Lezbniz rule was used in the proof of Proposition 4.9. 
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ware 


where f‘*) denotes the k-th derivative of the function f. ] 


The Leibnitz rule is 


gem ny 


REMARK 4.51 By Proposition 4.9, p()s(x) is the (ZL + M)-th degree 
Taylor polynomial about zero of f(«)q(x)s(x). By considering the remainder 
term in Taylor polynomial approximation, we can obtain an error estimate 
for Padé approximation. We have 


gltM41 L+M+1 
f(x)q(x)s(x) — p(x) s(x) = (L+M+1)! fan x)q(x)s(x)) Py 
Thus 
py ols A OES 
g(x) s(x)q(a)\(L+M+1)! dxttMt+1 ’ 
a p(w) plt+M+1 . 
#09515 wemenasaan™ ow 
where gltM41 
M* =e aged f(x)q(x)s(x))}, 


and where s is any polynomial of degree at most M. For example, s can be 


the constant s = 1. 


Example 4.15 
Consider f(x) = 


In(1+ <2) for « >0. The L = 2, M =1 Padé approximant 


to f(x) is 
6x + 27 
r(x) = ; 
6+ 4a 
Also, 
8 12 
treo | dat (Fa) =m m0 |(+a)s (+a)4| 
Consider s(x) = 1. Then 
4 4 
pla x*4 x 
(Oe 
q(x) (6+42)4! 36+242 
for x > 0. For x = 0.1, 
p(z) —6 
xz) — —~| <26x10™. 
fo) - 99) < 
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In contrast, for the Taylor polynomial approximation of degree 3 at x = 0.1, 


Aa 


fej 235K 
a ae 


and the Taylor polynomial of degree 3 requires same computational work to 
evaluate but is a factor of 10 less accurate than the 2,1 Padé approximant at 
x=0.1. 


Padé approximations are commonly used for f(x) = e”. (We will see their 
use in the study of numerical methods for solving initial-value problems.) 
Table 4.4 gives a few of these Padé approximants. 


TABLE 4.4: Padé approximants to e* 


1+ 2/3 


1—a/2+ 47/12 


REMARK 4.52 To find Padé approximants to e~*, replace x by —z in 
Table 4.4. U 


REMARK 4.53 _ A precise error formula for f(x) = e” is 


2 P(x) _ 1 oe L(_4)Mgt 
oF rete I, OO Meee 


et 14g 


_ “ab ee Me-« 'Z 
Se ee 


(letting z = 1-—t/x andt=a— 22). 
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4.6.4 Rational Interpolation 

Recall that r € R(L, M) if r(x) = p(x) /q(x), where p € Pr and q € Py. 
Problem: Find r € R(L,M) such that r(a;) = yi, 1 < i < k, where a;, 
1<i<hk, are distinct points on [a, }]. 
Difficulty: r(a) may not exist, as the following two examples illustrate. 


Example 4.16 
Suppose that r € R(0,M). Then, r(x) = 1/q(x). Suppose that y; = 0 for 
some 7. Then we cannot interpolate this point. ] 


Example 4.17 
Suppose that r € R(1,1), ie., 


r(x) 


Suppose that y1 = yo # ys. Then r(x.) = r(x2) = yr = Ye, and it follows 
that 


— ate 
~ bo + by x" 


Lj +aq _ L2+4o 
bo tba bp + by x9’ 
and thus, bo (a1 =, x2) = aoby (x1 =, x2), or bo = aob}. 


1 
nie ee Thus, r(a) is constant, and 


Case a: If bj 40, then r(x) = CRS 
1(a0 1 x 1 


r(x3) # Ys- 


Case b: If b; = 0, then bo = 0 and the denominator vanishes, so r(x) does 
not exist. 


0 


Example 4.18 
Suppose that r € R(L,0). Then the interpolation problem has a unique solu- 
tion when kK = L +1. Indeed, r(x) is the Lagrange interpolating polynomial. 


If r(a) does exist, we have the following. 


THEOREM 4.20 
Suppose that f has k continuous derivatives on [a,b] and suppose that a = 
Ly <a. < +++ < ap = bd. Letr = p/q € R(L,M), wherek = L+M +1, 
interpolates f(a;) at vj, 1<i<k. Then for each x € |a,b], there exists a 
€ = €(x) € [a,b] such that 

p(t) _ (@— 21)(u — w2)... (a — wx) a*( Fa) 


fla) — OE = (4) 
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where we assume that q(x) does not vanish on [a,b]. As a consequence, 


k 
max | f(x) — p(x) < max ae a dk f(x)q(2) 
a<a2<b q(x)|~ a<x<b}| kiq(x) | a<a<b dak 
PROOF If 
f(xi) - oe =0 forl<i<k, 
then 
f(ai)q(ai) — p(ai) =0 for 1 <i<k, (4.48) 


since q(x;) # 0 fori = 1,2,...,&. Hence, p(x) is the Lagrange interpolating 
polynomial of f(x)q(x) for points 7;, 1 <i <k. Thus we know for Lagrange 
interpolation, 


DH VAMOS 09) oe NE k 
f(e)a(2) — p(x) = eaien stents (fay) 


for some (x) € [a,b]. Dividing by q(x), we obtain the desired result. 


Next, we consider a type of orthogonal basis with certain mathematical 
properties similar to those of the Fourier series basis (sin(kx) and cos(ka), or 
e'F®) Various bases in this class can be designed to do a good job at particular 
approximation tasks. 


4.7 Wavelet Bases 


In this section, we give a brief description of wavelet bases in L?(R). Two 
good references are [3] and [15]. Wavelets are special sets of orthogonal func- 
tions in L?(IR), where L?(R) is the set of integrable functions f such that 


i . P(a)de < oo, 


and where orthogonality is with respect to the inner product 


(f,9) = ie f(x)g(x)dz. 


Wavelets are distinguished by being able to accurately represent high-frequency 
signals. (Because wavelets are orthogonal basis functions on the inner product 
space L?(IR), the least squares approximation in terms of a wavelet expansion 
is easy to obtain.) 
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4.7.1 Assumed Properties of the Scaling Function 


Underlying the development of wavelets is a “scaling function” y(x) € 
L?(R). We assume that ¢ satisfies: 


L 
g(x) = 5° apy(2x — 2) (4.49) 
L=0 


for some real numbers ag, where y(x) = 0 for  < Oora > L, ie, y is 
supported on [0,Z). For notational convenience, we also assume that the 
expansion in (4.49) is infinite, but ag = 0 for @ < 0 or € > L. In addition, 
assume that 


am plu — k)p(a — £)da = dpe. (4.50) 


—co 


Example 4.19 

L=1, a = 1, and aj = 1. Then g(x) = y(2x) + y(2x — 1), and (az) 
is supported on [0,1). This implies that y is constant on (0,1). Hence, by 
(4.50), p(w) = 1 for x € [0,1). Note that 


1 ifk<a<k+l1 
p(x —k) = 
0 otherwise. 


PROPOSITION 4.10 

Equations (4.49) and (4.50) give the condition 
1 [oe) 
5 S- A2k+mA2+m = OK, —0O<k,b< ov, 


m=—Ooco 


PROOF 


One = /- p(x — k)y(a — £)dx 


es Se am? (2(a — k) — m) 3 Any (2(a — €) — n) dx 


m>=—Cco n=—Cco 


7 5 dd aman i p(2x — 2k — m)p(2x — 2 — n)2dx 
= ; d. a AmOnd2k-+m,2+n = . d. Om 42(k—2)-4+-m 


1 [o.e) 
= 5 S- Ayn420Am+2k (by letting m = m — 20.) 


m=— Co 
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Example 4.20 
Let y be as in Example 4.19. Note that DL = 1, ao 1, a 1 satisfies 
Proposition 4.10. Specifically, 
ee 1 1 
3 E23 A2k+mAw+m = 7 A042—2k F U1 42e—2k-+1 
1 1 
= 7 t2l-2k 7 5 12e-2k+1 
= One. 


Now let Vo C L?(R) be the set of integer translates of , i.e., 


Vo = {! €L(R): f(z)= S> axy(a- of (4.51) 


k=—0o 


Then Vo is a closed subspace of L?(R). In addition, 
ax= f fle)g(e— bas 
by the orthogonality property (4.50). 


Example 4.21 


If y is as in Example 4.20, then f € Vo has the form shown in Figure 4.11. 


FIGURE 4.11: The graph of a piecewise constant function f, as in Exam- 
ple 4.21. 
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We now define V,, as dilates of Vo, i.e., 


VS {! €17(R): f(z) = SY) agy(2-"2 - of ; (4.52) 


REMARK 4.54 Note that y(2~"z) has support [0,2”L) and y(2~"a—k) 
has support [2°k,2"k + 2”L). Also, if f € V,, then 


- [fees — byez ae 


/ y?(2-"2 — k)dx ae 


REMARK 4.55 = Notice that if f € V, then f € V,-1. We thus have the 
containment hierarchy: 


“CW CW CUCHWCV1CVe2C::: Cc L(R). (4.53) 


To see this more clearly, suppose for example that f € Vo. Then 


lee) oo L 
f(a)= S> onp(a—k)= S> an D> (Qe —2k- 0). 
k=—0o k=—0o £=0 
Hence, 
-Sn S- apy(2x — 2k — 0) 
k=—0o 
= 3 ag 2 We (2 - i) (substituting & = 2k + ¢) 
= yet agge(x) € V_1_ (where (k — £)/2 is an integer.) 
Thus, if f € Vo then f € V_1. 


Example 4.22 
For Example 4.21, f € V_1 has the form shown in the following figure, 
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U 


We now make one final assumption concerning the scaling function y(z): 
Assume that 


zi € Vo for j = 0,1,...,N—1. (4.54) 
That is, 
x= S- a; np(e — k) 
k=—0o 
for some {aj;,~}2_., - 
REMARK 4.56 WN and L are related to each other. 


Example 4.23 
For Example 4.22, with D = 1, a9 = 1, a; = 1, and we see that N = 1. That 
is, 1 € Vo, but x Z Vo. 


The following convergence result can now be shown for least-squares ap- 
proximations in V,. 


THEOREM 4.21 
Suppose that (4.49) and (4.50) are satisfied for positive integers L and N. 
Let f € CN(R)NL?(R) and let f; € V; be the least-squares approximation to 
f in Vj, 1.€., 

f(z) = S- ayne(2Ia—k) with aj, =29(f, y(2 4x — k)). 


k=—0o 
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Then the sequence, fi, fo, f-1, ... converges to f with order N. 
PROOF See [87]. 


REMARK 4.57 Theorem 4.21 implies that the union of the V, is dense 
in L?(R), ice., UV, = L?(R). 


Example 4.24 
For Example 4.21, L = 1 and N = 1. Thus, the sequence fz, fi, fo, f-1, .-- 
converges to f € C!(R)M L?(R) with order 1. 


4.7.2 The Wavelet Basis and the Scaling Function 


We now develop an orthogonal basis in L?(IR). Let W,, be the orthogonal 
complement of V, in V,—1. That is, 


Wr U Vn = Va-1 and Wp L Vp. (4.55) 


Thus, if f € V,_-1 then f = fi + fo where fi EW, fo E Vn, and (fi, fo) = 0. 


REMARK 4.58 Suppose that n < m. Then Win UVn = Vn-1; Wm 1 
Vins Wn U Vn = Vn—1, and W, L Vz. But Wm C Vin-1 C Vn, SO Wm L Wh. 
Thus, ifn 4m, W, L W,,. This implies, along with Remark 4.57, that the 
direct sum of the W,, is L?(IR). Hence, 


© oWn = (J Va = L7(R) and Wp l Wm forn#m. (4.56) 


U 


REMARK 4.59 Recall that Vo is spanned by integer translates of v(x). 
V_ is spanned by integer translates of y(2x) and y(2x—1). Since Vo UWo = 
V_i, this implies that Wo is spanned by integer translates of some function 
w(x). Thus, 


Wo= | F€ 24): Je) = 3 cunt —)h 


k=—0oo 
Similarly, the spaces W,,, —co < n < ov, are dilates of Wo. Thus, from (4.56), 


Co 


L?(R) = {Fs a ongwrte 1) 


n=—co k=—oo 
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DEFINITION 4.22 The function w(x) is called a wavelet. (Dilates and 
translates of w(x) are orthogonal and form a basis for L?(R).) We now wish 
to determine w(x) from the scaling function p(x). Note that w € Wo C V_1 
80 


w(r)= SY > dbep(2ax— k). (4.57) 


However, recall that p € Vo C V_1, so 


Co 


L 
p(x) = s ary(2% — k) = a any(2x — k). (4.58) 
k=0 


k=—0o 


The following proposition shows that by = (—1)¥ay_x, and thus 


PROPOSITION 4.11 


If by = (—1)¥ay_x, then 


io yp(a —k)w(a — 2)de =0 and ie w(a — k)w(a — C)dx = dpe. 


—co —oco 


(This guarantees that Wo L Vo.) 
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PROOF 


[ee Hear = [~ | Shai 


M=—%O N=—CO 


p(2x — 2k — m)y(2x — 2 — n)| 2dx 


1 [oe 
- Am+200m+2k (See the proof of 


Dae Proposition 4.10.) 
i — 2Qk+m 
25 S- Am+2e(—1) a1—2k—m 


(since each product aja; occurs twice 
with opposite signs) 


ie w= bute — Ode =f y bmy(2x — 2k — m): 


ee —& |Lm=—oco 


S bnp(2Qa — 20 — n) | dx 


1 Co 

= 5 So dok-tmboe+m 
1 [oe) 

=o s (—1)?*+™ a) _o4—m(—1)¥ "120m 
1 Co 

= 5 ss Q1-2k—-m@1—2l—m 


= 0xe, by Proposition (4.10). 


Example 4.25 
Continuing Example 4.21, LD = 1, ag = 1, ay = 1, so 6) = —1 and bp = 1. 
Thus, w(x) = y(2x) — y(2x — 1); see Figure 4.12. That is, 


1, 0<2<1/2 
w(z)=4-l1, 1/2<a<1 


0, otherwise. 


The basis constructed using this w(«) is called the Haar basis. 


The Haar basis is easily constructed and is the simplest wavelet basis. This 
basis has L = 1 and N = 1, and Theorem 4.21 shows that the approximations 
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FIGURE 4.12: w(x) = y(2x) — y(2x — 1) in Example 4.25. 


converge in this basis, but not particularly fast. To improve the convergence 
rate, an N > 2 would be required. However, in this case the scaling function 
(x) is complicated. To determine v(x), we need the following additional 
relation. 


PROPOSITION 4.12 
S- (—1)*a,k? = 0 for j =0,1,...,N—1, where the ax are as in (4.49), 


k=—0o 


subject to the additional assumption (4.54). 


PROOF By (4.54), 27 € Vo for j = 0,1,..., N—1. Since w € Wo, w L 2 
for 7 = 0,1,...,N —1. Thus, 


ea 7 j 
- - S- Oe) bpy(2a — k)dx 
Ses 


3 kJ rox f( (Qx — k)" o(2a — k)2dx 


k=—0o 


r=0 k=—0o =08 
But oss 
/ x p(a)dx # 0, (4.59) 


because of (4.50) and since x” can be written as a linear combination of 
translates of y for r = 0,1,...,N —1. (Recall that x” € Vo; you will fill in 
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the details in Exercise 39.) Thus, letting j = 0, 


k=—0o 


Letting 7 = 1, 


s bik = 0. 


k=—0o 


Examining the preceding computations in this proof, we can thus conclude by 


induction that 
So bck? =0 
k 
for j =0,1,...,N —1. But by = (—1)*ay_% by Proposition 4.11. 


4.7.3 Construction of Scaling Functions 


We now consider construction of 
L 
p(a) = S- agy(2a — £). 
£=0 


First, observe that ag # 0 only for @ = 0,1,...,£. Thus, we have £ + 1 
unknowns. Keeping this in mind, Propositions 4.10 and 4.12 give 


(A) Ss (—1)*axk? = 0 for j = 0,1,...,N —1, N conditions; 
k=—oo 
1 Co 
(B) 2 » A2k+ma24+m = One for —co < k,£ < oo, (L + 1)/2 conditions, 


m>=—Cco 


considering the nonzero a’s and redundantly listed conditions. 


Thus, we have a total of N + (£ + 1)/2 conditions. Equating the number of 
variables and number of conditions shows that, given N, we need L = 2N—-1. 


Example 4.26 
L=N=1. 

(A) ao — a, = 0. Thus, ap = 1 and a, = 1. 

(B) 345 + 5a = 1 
Hence, v(x) = y(2x) + y(2x — 1), with v(x) supported on [0,1]. (Also, 
w(x) = p(2x) — y(2e% — 1).) But y(x) = (2x) + y(2x — 1) implies that y 
is constant. Thus, y(z) = 1 for « € [0,1). Hence, Vo consists merely of 
translates of y(z) =1 on x € [0,1) toR. 
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Consider 


ney = | O0<a<l1 


0, otherwise. 


fo(z) = D> axy(a@—k) =(e'—e*)p(x), and fo(z) € Vo, 
k=—0o 
_ Qe? 1), O<2 <1/2, 
ea 2 ane ie —eV?), 0<2<1, 


and f_i(a) € V_1. Here, fo € Vo and f_1 © V_ are approximations to 


fe DR). 


Example 4.27 
N=2and L=3. 
We need to find v(x) = apy(2x) + aiy(2x — 1) + agy(2x — 2) + agy(2x — 3). 


(A) do — 4, + a2 — a3 = 0, 
—a, + 2a2 — 3a3 = 0, 


(B) 


(ag + af + a3 + a3) =1; 
$(agaz + a1a3) = 0. 


Solving (A) and (B) gives: 


ola) = F[(1. + VB)p(20) + 8+ VBg(2x— 1) 


+ (3 — V3)yp(2e — 2) + (1 — V3) (2a — 3)). 


An iterative approach is employed to find y(a). (The above equation only 
determines v(x) to within a constant factor.) In the iterative approach, y(z) 
is first found at x = 1 and x = 2. Specifically, since 


3 
p(x) = So ang(25 —k) for0<a<3 
k=0 


and y(a) has support [0,3), 


3 
(J) = S- any(2j — k) 
k=0 
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is a linear equation in the unknowns y(j), 7 = 1,2. Also, y is normalized so 
that 


2N-—2 
DD o(j) =1. 


This determines y(j) for 7 = 1,2,..., N—2. Values at half-integers, quarter- 
integers, etc. then can be determined. For example, 


2N-1 2N-1 
e(At/2) = > ave (Ek) = > aga ~W). 
k=0 k=0 


The graph of such a ¢ (called a Daubechies scaling function) is illustrated in 
Figure 4.13. 


1.5 


0.5 1 1.6 2 2.5 3 


FIGURE 4.13: Illustration of a Daubechies scaling function y(z). 


REMARK 4.60 Using several scaling functions simultaneously, wavelets 
that are piecewise polynomials can be constructed. (See [3, pp. 197-200].) O 


In §3.3.8.4 (page 139), we considered least squares approximation from a 
finite point set in the context of linear algebra. We now revisit this subject 
within the applied and approximation theoretic context of this chapter. 
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4.8 Least Squares Approximation on a Finite Point Set 


In an earlier section of this chapter, we considered least squares approxi- 
mation to f € Cla,b|. For example, we find p, € P” such that || f — pnll2 < 
|. f — dn|l2 for all g, € P”. We now consider least squares approximation on a 
finite point set. 


Problem 4.1 Let G be a subset of R and x; € G fori =1,2,...,N. Given N 
pairs of real numbers (x1, Y1),---,(a@n, yn) andn < N functions uz, u2,...,Un 
defined on G, we seek a function of the form, 


u(x) = S- Apur(2), 
k=1 


which will approximate the values y1,y2,...,yn at the points x1, 22,...,2N. 
For example, if ui(x) = 1, ue(x) = x, then u(x) = A, + Age is a line, as is 
illustrated in Figure 4.14. (Note that the x; need not be distinct.) 


FIGURE 4.14: Graph ofa possible linear least squares fit u(x) = A1+Aoa. 


Problem 4.1 is often called the linear least squares problem: Even though the 
fitting function u = >> A,ux may be nonlinear, wu is a linear combination of the 
Ax, and the A, can therefore be obtained with techniques from linear algebra. 


REMARK 4.61 We can extend the problem to G, a subset of R™ (or 
C™), where x; € G fori =1,2,...,N. We then seek u(x) = D> Axug(x) for 
k=1 


reEG. 
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REMARK 4.62 The functions uj, ug, ..., Up, considered in this section 
are powers of 2, ie., uj(x) = v*-!, i = 1,2,...,n. However, other choices, 
such as ui(x) = 1, uae(x) = sin(éx), uoeqi(a) = cos(éx), 2 = 1, 2,..., m and 
n = 2m-+1, are often used. The approximation in this case is referred to as 
a discrete harmonic approximation. 


Least Squares Solution The least squares solution to the problem is to find 
1, A2,.-+,An so that the sum 


N n 2 
S- (. -S> sen)) (4.60) 
j=l k=1 

is minimized. 


REMARK 4.63 — If n = 2, wi(x) = 1, u2(x) = x, the problem reduces to 
the familiar problem of fitting a line to the data points, i.e., finding A; and 


Ag such that S> (yj — (Ai + A2x;))? is minimized. 
j=l 


Now, let y = (yi, y2,-: ,yn)? € R™ and 
ug = (ux(a1), Ue(v2),--+ ,ux(@n))”? €R™ 
for k = 1,2,...,n. Then (4.60) becomes the problem of minimizing 27 z, 
where z = y—w and w = )Y> Aguz. Thus, we identify problem 4.1 with 
k=1 


the following problem: Given vector space V = R% and a subspace W of V 
defined by W = span(uj,u2,...,Un), find w € W such that 


lly — wl? = (y—w)"(y— wv) < lly — all? 


for allu € W. Thus, w is just the best approximation to y in inner product 
space RY, and the \,x satisfy the linear system: 


Se Ne ey) = (pty), for 1 Dart (4.61) 
k=1 


Me 


n 
Hence, w= >> A,Ug approximates y, and w(x) = Anur(£) is the solution 


k=1 


k=1 
to the least squares problem. 
To find the A, with (4.61), it is assumed that the uz, 1 < k < n, are linearly 
independent vectors in R. Thus, the functions u1(7), u2(x), ..., Un(«) must 
be linearly independent on the subset {x1,22,...,an} of G. 
Equations (4.61) can be put into the form 


AT A\= AT y, (4.62) 
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where A = (1, U2,...,Un) is a N x n matrix. (A7A is n x n.) Note that 


this is precisely the form (3.29) on page 140, where we derived it from the 
perspective of an overdetermined linear system of equations. 


REMARK 4.64 A’ A is nonsingular (in fact, positive definite) if and 


only if ui, U2,...,Un are linearly independent. To see this, consider 
n 
T aT 
a A’ Ar = y (x;u;)(xj;u;) 3 LjUs, y LjU; 
ij=l 


A’ Ais singular if and only if 27 A? Ax = 0 for some & #0. But 27 AT Ar = 0 

if and only if 3 x,u; = 0. However, for x 4 0, bs x,u; = 0 if and only if uw, 
t=1 

U2, +--+; Un are linearly dependent. Thus, AT AG is singular if and only if uy, 

U2, -+-, Un are linearly dependent. 


The above discussion is summarized in the following theorem. 


THEOREM 4.22 


Suppose G is a subset of R and x; € G fori=1, 2,..., N. Given N pairs of 
real numbers (41,Y1), ---, (tn, yn), if Ui(Z), U2(Z), ..., Un(x) are linearly 
independent on {21,22,...,xn}, then the least squares solution 


x) = s AnUR (2) 
k=1 
which minimizes 


N 
WC Yj — w(x;)) 


j=l 


is unique and is given by the solution to AT AX = AT y, where 
A = (U1, U2...Un), 
A= (Ar, Aas-++ yAn)*, 
Y= (Yr, Y2r0° YN 


and 
Uk = (up(#1), as ,Un(tN))*. 


(Note that A is N x n and A7A isn xn.) 


REMARK 4.65 The so-called smoothing polynomials are obtained by 
choosing u;(x) = 2* 1 for i=1,2,...,n. 
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By Theorem 4.22, if ui(a), u2(x), ..., Un(#) are linearly independent on 
{x1,22,...,a2n} C G, then the least squares solution exists and is unique. We 
have 


PROPOSITION 4.13 


Suppose that at least n < N of the points x1, £2, ..., ©N are pairwise distinct. 
Then uy, Uz, ..., Un are linearly independent on G C R, where u; = x* +, 
i= 1, 2,..., n. (Thus, the least squares solution exists and is unique by 


Theorem 4.22.) 


PROOF Let {21,22,...,¢n} C {1,02,...,un} be n distinct points on 
G. Consider 


cyur (aj) + coua(a;) +--+ + Cntn(a;) = 0 


for i = 1,2,...,n. We need to show that cy = cg = ++: = Cy = 0 or equiva- 
lently that c= 0, where Bc = 0 and 


U1(@1) Ug(a@1) +++ Un(21) C1 0 
C2 0) 
Be=| i: : |= 
U1(Gn) U2(Ln) +++ Un(@n) Cn 0 
But 
1 x gr 
1 LQ +c grat n 
det B=det|. . ie = |] @e-2;) 40, 
cS : re 
nf Ln eee grt a<k 
n 
since x; # x, for 7 4 k. Thus, c= 0, and uy, ..., Un are linearly independent. 
Suppose now that the set {a1,22,...,2n} has at least n distinct values. 
Then 
1 a ::: Pee 
Lace os 
A = 
1 ay-::: ree 
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and A? A\ = Ay has the form 


Me 
8 
Mz 
8 
S38. 

ie 
> 

Mz 
Se 


Il 
un 


N 
yt 
i=1 


Il 
un 
© 
Il 
un 


Mz 
8 
Mes 
8 

aad We) 
Met 
8 

& 

Daa 

bo 
Mz 

8 

Ss 


i 
a 
ss 
i 
ai 
ss 
i 
ws 
I 
=. 
ll 
Me 


Ma 
&y 
1 
Mi 
M2 
Q 
&, 
3 
bo 
- 
3 
Mz 
a 
aa | 
HF 
<S 


Il 
un 
s 
Il 
un 


n * 
where w(#) = >> A;a*~! is the smoothing polynomial that passes near the 
i=l 


points (#1,91),.--; (aN, Yn): 


Example 4.28 

Find the least squares parabolic fit w(x) = A1 + Aer + A3x? for the points 
(0,0), (1,2), (2,3), (3,10). 

Solution: For this example, N = 4 and n = 3. We have the table: 


a Ti | Yi cA v; x; Lei UFYi 

1 0 0 0 0 0 

2 1 1 1 2 2 

3 4 8 16 6 12 

4 9 27 81 30 90 
sums] 6]15/14 36 98 38 104 

Thus, A? A\ = A’y has the form 

4 6 14 At 15 

6 14 36 A2 | = | 38 

14 36 98 A3 104 


Solving, A; = 0.35, Az = —0.65, A3 = 1.25, and 
w(x) = 0.35 — 0.652 + 1.2527 


is the least squares parabolic fit to the points. 


REMARK 4.66 The normal equations, as in Example 4.28, are used 
mainly for theoretical purposes, illustration, and hand computations. Mod- 
ern software uses techniques such as the QR-decomposition (§3.3.8 of this 
book), because A? A is ill-conditioned. Computing the QR decomposition (or 
the singular value decomposition of §3.5) is a more stable computation than 
solving the system of normal equations. 
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4.9 Exercises 
1. For f(t) = sin(¢), 


. Consider Runge’s function: f(x) 


(a) compute the coefficients of the degree 3 Taylor polynomial approx- 
imation at t = 0; 


(b) compute the coefficients of the polynomial p3(t) that minimizes 
1 
[Ge - rawr 
t=-1 


(c) compute the coefficients of the degree 3 polynomial that interpo- 
lates f att = —1, t = —1/3,t = 1/3, andt=1. 


(d) Rewrite each of the degree-3 polynomials in 1a, 1b, and 1c in terms 
of the basis yp = 1, v1 =t, ye = t’, and y3 = x°, then compare 
coefficients. 


(e) Estimate the infinity norm of the error, for each of the approxima- 
tions in la, 1b, and lc. 


. Find the least squares approximation for the function f(a) = e* by a 


polynomial p; (2) = bo + 61a on [—1, 1] using the norm 


itl? = / (F(a) Pae. 


. Let f € C[-1,3] and let P” = span{yo, ¥1,.--, Pn}, with {y;}P_, or- 


thonormal. Let g* € P” be the least squares approximation to f, i.e., 
n 3 
for any p € P”, ||g* — fl? < |lp — fl’, where || f||? = f°, (f(a))?de. 


3 
Prove that / (f(x) — g*(x))p(x)dx = 0 for all pe P”. 
ea 


. Use Gram-Schmidt orthogonalization process to compute the first three 


polynomials go(x), qi(x), and q2(a#) which are orthogonal on the inter- 
val [0,1] with respect to the weight function w(a) = 1. Using these 
polynomials, obtain the least squares approximation of second degree 
for f(x) =x? on [0,1]. 


- 1 
~ [+427 


(a) Compute and graph the interpolating polynomials (with a graph 
of Runge’s function itself) using 5, 9, and 17 equally spaced points 
in the interval [—5,5]. What do you find? 
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(b) Use the error formula for polynomial interpolation to estimate the 
error in each of the interpolants in part 5a. 


(c) Repeat part 5a, except using Chebyshev points instead of equally 
spaced points. 


(d) Use the error formula for polynomial interpolation to estimate the 
error in each of the interpolants in part 5c. 


6. Let P, be the a polynomial of linear degree (degree 1) interpolating f (x) 
at the points xo, 71. 


a) Show that P; is unique. Moreover, if f € C?[z9, 21], show that the 
error in the linear interpolant satisfies 


fle) — Pia) = Fa m0)(e— mi) f"E(a)), 20 <2 <n. 


b) Find the function (x) explicitly for f(x) = 1/x, vw) = 1, and 
x, = 2. Furthermore, find max, §(2) and unin, é(2). 


7. Show that, in the setting of Remark 4.15 (on page 218), 


Woo = 2-2" — a)" 


8. Show that the matrix A in the system (4.21) on page 222 is positive 
definite. 


9. (Literature search question) What is the constant c, in Remark 4.27 on 
page 233? What happens if f € C|a, b]? 


10. Repeat the computations for Example 4.10 on page 237, except use the 
interpolating polynomial at the six Chebyshev points, rather than at six 
equally spaced points. 


11. Let f € C(O, 1] and let >", w; f(x) be an approximation to ie f(a) dx. 
(The x; are called points and the w; are called weights.) Assume that 
O<w;<land0<2; <1 fori=1,2,...,n, and assume >>), wi = 1 
for any n. Let 


Buf) = f #2) de-Y wiles) 


be the error in the approximation. Assume that E,,(P,,) = 0 for P, any 
polynomial of degree < n. Use the Weierstrass approximation theorem 
to prove that, given « > 0, there is an N > 0 such that |E,(f)| < ¢€ 
when n> N. 
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12. 


13. 


14. 


15. 


16. 


Classical and Modern Numerical Analysis 
Let f € C™[0,1] be such that 


max, [fF (2)| < K(n+2)! for n=1,2,... and some K > 0. 
Let v; =i/n, i =0,1,2,...,n, and let P,(x) interpolate f(x) at x = x; 
for 1 = 0,1,2,...,n. Prove that, given € > 0, there is an N > 0 such 
that ||f — Prllo <¢ when n> WN. 


The polynomial P,,() interpolating the function f(x) at the nodes x, 
for k =0,...,n is given by 


n 


P,(«) = S > Le(2) f(tr), where L(x) = II “ ae 


k=0 i=0,i#k 


n 


Let ~(x) = [[@ — «;). Show that > LDx(x) = 1 and 


i=0 k=0 


Consider the function P3(2) = 23 — 9x? —20x+5. Find a second degree 
polynomial P2(xz) such that 6 = pmax, |Ps(2) — P,(x)| is as small as 


possible. 
Consider interpolating the function f(x,y) at the n? points (x;,y;) for 
i,j =1,2,...,n, where {x;}7_, and {yj}, are each pairwise distinct. 
Let 
a et “ ¥ — Yk 
L(x) = 5 l = ’ 
i(2) I moa, uw II =a 
mAi kAj 

n n 
and p(x,y) = $2 S> cijli(a)lj(y) 

i=1 j=1 


(a) Find c,; for 7,7 =1,...,n so that p(x, y) interpolates f(z, y) at the 
n? points. 


(b) Show that S$” S71,(2)l;(y) = 1. 
i=1 j=l 
Suppose the N;,(2) are as in (4.10) on page 215, and form the system of 
equations in the unknowns cx associated with the conditions 


{p(xi) = Pa yen 


17. 


18. 


19. 


20. 


21. 


22. 


23. 
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with 


p(z) = > cu Ne (a), 
k=0 


analogously to how we formed the Vandermonde system and the (trivial) 
system of equations associated with the Lagrange basis functions. 


(a) Show that the matrix for this system of equations is lower triangu- 
lar. 


(b) What are the cz? 


Let 20, %1, .--, %n be distinct real numbers, and consider the inter- 
polation problem P,(x) = S°;~9 cye** such that P,(xi) = yi, i = 
0,1,2,...,n, with {y;}?_, arbitrary real values. Prove that there is 
a unique choice for the coefficients co, C1, C2, ---, Cn- 


Using Hermite interpolation, find the polynomial P:(a) that satisfies 
P2(0) = f(0), Po(2) = f(2) and Pi(2) = f’(2). Also estimate the error 
If(a) — Pa(x)| for f € C0, 1), 

Let x; = th fori =0,1,...,N be N +1 distinct points on [0,1], where 
h=1/N. Assume that f € C4[0,1]. Let H2(x) be the piecewise cubic 
Hermite interpolant to f(a) such that H2(x;) = f(a;) and HS(x;) = 
f' (ai) for i= 0,1,...,N. Prove that 


_§F rag ae 
max, f@) — Fal) S 7g 


Find the best uniform approximation (minimax approximation) aj +a1x 
to f(x) = In(1 4+ 2) on the interval [0,1]. Note that 


Sie < spo ee 
Puls |In(1 + @) — a9 — aya] < oe |In(1 + x) — bo — bi 2| 


for any bo, b; € R. 


Suppose that f(a) is an even function on[—a,a]. Show that the best 
uniform approximation P*(x) to f(x) on [—a, a] is also even. 


Complete the computations in Example 4.10 on page 237. That is, 
approximate sin(x), « € [—0.1,0.1] by an interpolating polynomial of 
degree 5, and use interval arithmetic to obtain an interval bounding the 
exact solution of sin(0.05). Do it 


(a) with equally spaced points, and 
(b) with Chebyshev points. 


Prove Remark 4.36 (on page 243). 
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26. 


27. 


28. 


29. 


30. 


31. 
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Show that A is as in (4.30) on page 244 (in the context of the proof of 
Theorem 4.14). 


. Show that conditions (n) in Definition 4.17 of the natural spline inter- 


polant (on page 246) lead to the system (4.34) (on page 247). 


Let si(z) = 1+ c(a +1), -1 < x < 0, where c is a real number. 
Determine s2(z) on 0 < a <1 so that 


is a natural cubic spline, i.e. s’”(—1) = s”(1) = 0 on [—1, 1] with nodal 
points at —1, 0, 1. How must c be chosen if one wants s(1) = —1? 


Suppose that f(a) satisfies a Lipschitz condition |f(x)— f(y)| < L|x—y| 
for all x,y € [0,1]. Let U(x) be a piecewise constant approximation to 
f(x) such that U(x) = f(a;) for 7; <a < aj41, fori =0,1,...,N —-1, 
with x; =ih and h=1/N. Prove that maxo<2<1 |U(x) — f(x)| < ch for 
some constant c > 0. 


Let a = to < ty <... < typ = b. We wish to determine a function 
S € Cla, b] such that 


(i) S(t) = f(t), i =0,1,...,n. 
(ii) S(t) = S,(t) for t € [t;-1,t,], where S; is a quadratic polynomial. 


(a) With hj = t; = tj-1, show that 


2 : 
j 
(b) If S’(to) = f’(to), find the linear system S’(t1),...,5’(tn) must 
satisfy. 


Prove Remark 4.43 on page 256. 


5 
2 


Let f(x) = (cosa —0.5)?. Let 


20 


k 
1px 1 1x 
gx(x) = Doi , where a= 5 : f(aje “* dx. 
j= 


Prove that there is a constant c > 0 such that ||gx — flo <¢/k. 


Let f € C?"+![-a,7]. Let f(x) = P__,, qe be a trigonometric 
approximation to f(a) such that f(0) = f(0) for 1 =0,1,2,...,2n. 
(f(a) can be considered a Taylor trigonometric approximation.) 


> 
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(a) Show that cp, —n < k <n can be determined, so f(x) exists. 


(b) Find f(a) for f(x) = /(«+3m) and n = 1. That is, find coef- 
ficients c_1, Co, C1 such that f(x) = c_1e7 +9 + ce. Then, 
write f(a) in terms of real trigonometric functions. 


32. Consider 


(a) Use the FFT procedure with N = 16 to compute the trigonometric 
interpolating polynomial to f. Arrange your computations in a 
table similar to the table on page 258. 


(b) Graph the trigonometric interpolating polynomial you have ob- 
tained. (You can use MATLAB, for example, to make the graph.) 


33. Let f(a) = In(1+ 2) (as in Example 4.15 on page 264) for x > 0, and 
consider possible interpolation at the points 7; = 0, x2 = .05, x3 = 0.1, 
tq = 0.15 with an LD = 2, M = 1 rational function. 

(a) If it is not possible to so interpolate f, then explain why. 
(b) If it is possible to so interpolate f, then find the coefficients. 


(c) If it is possible to so interpolate f, then use (4.47) (on page 266) 
to compute a bound on the interpolation error for 0 < x < 0.15, 
then compare this bound to the error bound obtained for the Padé 
approximant in Example 4.15. 


(Hint: Let p(x) = ap +aix+a2x? and q(x) = bo +bix, then consider the 
homogeneous system of equations (4.48) with the additional condition 
bo = 1.) 


34. Consider the function f(z) =| e* ds. 
0 


(a) Find the R(1,2) Padé approximant of f(z). 


(b) Estimate the maximum error in the above approximation for 0 < 
a<2. 


35. Obtain the R(1, 2) rational approximation to the function e® of the form 


ao + a1x 
1 + bx + box? : 


36. Show that the R(0,1) Padé approximant to f(x) = x does not exist. 


37. Find the R(0,1) Padé approximant to f(x) = 2+ 32. 
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38. Prove the assertions in Example 4.19 (on page 268). That is, using 
(4.49) and (4.50), prove that, for D = 1, ag = 1, and az = 1, that 


(a) y is supported on [0,1), and 
(b) y is constant on [0, 1). 


39. Fill in the details of why (4.59) on page 275 is true. 


Chapter 5 


Eigenvalue-Eigenvector Computation 


This chapter is concerned with the computation of eigenvalues and eigenvec- 
tors. Several good references for the material in this chapter are [49], [85], [97], 
and [101]. Before studying numerical methods for accomplishing this task, a 
few important definitions and results from matrix theory are presented (many 
without proofs). 

Computing the eigenvalues and eigenvectors of a matrix is inherently an 
iterative computational problem. In particular, Niels Abel proved that quintic 
equations are insoluble by finite algebraic formulae. Since the eigenvalue 
problem can be formulated as solution of an algebraic equation, i.e., det(A — 
AI) = 0, Abel’s result implies that eigenvalue computation requires iterative 
numerical methods of solution for n x n matrices with n > 4. 


5.1 Basic Results from Linear Algebra 


DEFINITION 5.1 Let A be ann xn complex matrix and x € C”. Then 
x is an eigenvector of A corresponding to eigenvalue » if x 40 and Ax = Ax. 
(A vector y such that y* A = Ay" is called a left eigenvector, and, in general, 


rF#y.) 
REMARK 5.1 4 is an eigenvalue of A if and only if det(A — AI) = 0. 
The determinant defines the characteristic polynomial 
det(A — AT) = X* + an-1 A"! + apg? + +A + Q0. 


Thus, A has exactly n eigenvalues, the roots of the above polynomial, in the 
complex plane, counting multiplicities. The set of eigenvalues of A is called 
the spectrum of A. Recall the following. 


(a) The spectral radius of A is defined by 


p(A) = max A]. 


an eigenvalue of A 
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(b) |A| < ||A]| for any induced matrix norm || - || and any eigenvalue A. 


(c) ||Alle = /p(A#A). If A? = A (that is, if A is Hermitian), then || A||2 = 
p(A). 


U 


DEFINITION 5.2 A square matrix A is called defective if it has an 
eigenvalue of multiplicity k having fewer than k linearly independent eigen- 
vectors. 


For example, if 


11 1 
A=(j aE then dy = Ap = 1, but 2 =e (4) 


is the only eigenvector, so A is defective. 


PROPOSITION 5.1 

Let A and P ben xn matrices, with P nonsingular. Then X is an eigenvalue 
of A with eigenvector x if and only if X is an eigenvalue of P~'AP. with 
eigenvector P~'x. (P~'AP is called a similarity transformation of A, and A 
and P~'AP are called similar.) 


PROPOSITION 5.2 
Let {x;}?_, be eigenvectors of A corresponding to distinct eigenvalues {A;}"_,. 
Then the vectors {x;}"_, are linearly independent. 


REMARK 5.2 _ If A has n different eigenvalues, then the n eigenvectors 
are linearly independent and thus form a basis for C”. (Note that n different 
eigenvalues is sufficient but not necessary for {#;}?_, to form a basis. Consider 
A=TI with eigenvectors {e;}7_,.) U 


PROPOSITION 5.3 

Let A be ann x n complex matriz. Then A is nondefective (that is, A has 
a complete set of eigenvectors) if and only if there is a nonsingular matrix X 
such that X~'AX = diag(A1, A2,---,An), where {A }%, are the eigenvalues 
of A. The i-th column of X is an eigenvector corresponding to A; and the i-th 
row of X~' is a left eigenvector corresponding to Xj. 


PROPOSITION 5.4 

(Schur decomposition) Let A be ann x n complex matrix. Then there exists 
a unitary matrix U (UMU = I) such that U" AU is upper triangular with 
diagonal elements A1, A2,.--,An- 
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We now have the following useful result: 


THEOREM 5.1 
(Gerschgorin’s Circle Theorem) Let A be any n x n complex matrix. Then 
every eigenvalue of A lies in the union of the discs 


Ky lazs), «where kK {2eC 2/2 —aj,| <p; } tot p= 12,4 2 
j=l 


and where the centers aj; are diagonal elements of A and the radii p; can be 
taken as: 


k=1 
kAj 


(absolute sum of the elements of each row excluding the diagonal elements), 


k=1 


kj 


(absolute column sums excluding diagonal elements), or 


pj =p =( S- Jajx|?)1/? (5.3) 


jk=1 
jxk 
for j = 1,2,...,n. 
Example 5.1 
2 1 4 
A= |-1-3i 1 
3-2 -6 


Using absolute row sums, 


a1 = 2, Pl= 3/2, 
a22 = —3i, 2 >= 2, 
a33 = —6, pg = 5. 


The eigenvalues are in the union of these discs. For example, p(A) < 11. Also, 
A is nonsingular, since no eigenvalue A can equal zero. (See Figure 5.1.) 


PROOF = (of Theorem 5.1) Let \ be any eigenvalue of A. We will show 
that A must lie in one of the circles. 
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Ane 
2) 


e 


YA 


FIGURE 5.1: Illustration of Gerschgorin discs for Example 5.1. 


Case 1 Suppose that A = aj; for some j. Then the assertion is satisfied. 


Case 2 Suppose that A # aj; for any j, 7 =1,2,...,n 
Consider AJ — D where D is the diagonal matrix dj; = aj;;, AJ — D is 
nonsingular and (AJ — D)~' = diag (1/(A—a,;)). Let w be an eigen- 
vector of A corresponding to eigenvalue A. Then Aw = Aw and thus 
(A — D)w = (AI — D)w so w = (AI — D)“1(A — D)w 


Let || - || be a matrix norm compatible with a vector norm, i.e., || Az|| < 
|| A]|||2||. In particular, 


|Azlloo < ||Allooll#lleo 
|Arl1 < [Allillells 
||Azll2 < ||Allellelle, or 


|Azll2 < ||Allallall2- 


Then, 


[lw] = [|Z — D)“*(A— D)wll < |]AL— D)~*(A — D)Illlul- 


Hence, 1 < ||(AZ — D)~1(A — D) |. 


We now consider || - || = |] - [loos |] - ||1, and || - |lz 
(A) || - || =Il- loo We have: 
[azn] Pj ; 
a a Eq. (5.1) for pj. 
28 Bsa al io ea 


Thus, |A — aj;| < p; for at least one 7. Hence, A € Ky, (a;;) 
for some j. 
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(B) || - || = || - |]1 We have: 
1< max Roa lang] napa =i using Eq. (5.2) for p;. 
~ 1<j<n— |A—ajj| << |A— aj, | j 


Thus, |\ — aj;;| < p; for at least one j7. Hence,  € Ky, (a;;) for 


some 7. 
(C) |l-ll=Il- lle We have: 
1 
2 
— lagi? p ; 
rs en aa | sing Eq. (5.3) for p. 
iz PS |A—a;;/2 |} — min [A— a;,| using Eq. (5.3) for p 
j.k=l II Par dj 
j#k SJ 


Thus, |\—a,;| < p for at least one j. Hence, \ € Kp, (a;;) for some 
ve 


REMARK 5.3 __ It can be shown that if $ is the union of m discs Kp, 


such that 9 is disjoint from all the other disks, then S contains precisely m 
eigenvalues. (See [68].) 


We now consider some results for the special case when matrix A is Hermi- 
tian. Recall that, if A” = A, then A is called Hermitian. (A real symmetric 
matrix is a special kind of Hermitian matrix.) 


THEOREM 5.2 

Let A be Hermitian (or real symmetric). The eigenvalues of A are real, 
and there is an orthonormal system of eigenvectors W1,W2,-.-,Wn of A with 
Aw; = Ajw; and (w;, We) = we w; = Ojk- 


REMARK 5.4 The orthonormal system is linearly independent and spans 
C”, and thus forms a basis for C”. Thus, any vector x € C” can be expressed 


as 
n 


n 
cH SS ajwy, where a; = (a,w,;) and ||x||3 = S- la,;|?. 


j=l i=1 


We have two more interesting results concerning eigenvalues of Hermitian 
matrices. 
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THEOREM 5.3 

Let A and B be Hermitian matrices. Then, the eigenvalues \;(A), A;(B), 
1,2,...,n arranged in the order 4(A) > A2(A) > +++ > An(A), A1(B) = 

49 (B ) a .An(B) satisfy the inequality 


|Aj(A) — Aj(B)| < ||A- Bll, 7 =1,2,---,0 


3 


for any matriz norm compatible with a vector norm, t.e., || Ax|| < || Alla]. 
PROOF (See [90].) 


REMARK 5.5 __ If the elements of A and B are close, then Theorem 5.3 
asserts that the eigenvalues are close. 


We now have a modified form of Gerschgorin’s Theorem for Hermitian ma- 
trices. 


COROLLARY 5.1 

(A Gerschgorin Circle Theorem for Hermitian matrices) If A is Hermitian 
or real symmetric, then a one-to-one correspondence can be set up! between 
each disc K,(a;;) and each \; from the spectrum Ay, A2,.--,An of A, where 


1/2 


n n 
p=max > |aj| or p= | >> lajal? 
J k=1 


jk=l 
kj j#k 


(Recall that K,(a;;) = {z € C: |z — aj3| < p}-) 


PROOF Let D = diag(a;;). The eigenvalues of D are the diagonal ele- 
ments @11, @22, ---, @nn, Which are real since A is Hermitian. Let a},,...a/,, 
be a permutation of a41, G22, .--, G@nn So that aj, > ab, >--- > al. Using 
Theorem 5.3, we have |A;(A) — aj;| < ||A — Dlla for j = 1,2,...,n where 
a = FE or oo. (Note that a = 1 yields, by symmetry the same result as 
@ = 00.) 

Letting a = cw, we have 


|Aj(A) — aj5| <p = max Slay FS 1g 2 icp 
k=1 
kAj 


1 This does not necessarily mean that there is only one eigenvalue in each disk; consider the 


matrix A = (S45). 
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Letting a = E, we have 
1/2 
|Aj(A) — aj] < p= x |ajx|” jo PH BZ ayn. 
j,k=1 


j#k 


Thus, eigenvalue A; (A) lies in the closed interval K,(a‘,;) = [a’,; — p, a4; +p]. 


U 


Example 5.2 
3 -1 1 
A={-1 15 1 
1 1 10 
We have a), = 15, ag. = 10, a3 = 3, 
1/2 
max > lajn| =2, and S- |ajx,| = 6)/?. 
k=1 j,k=1 
kA xk 
so we may use p = 2. Thus, 13 < A, < 17, 8< Ag < 12,1 < A3 <5. 


5.2 The Power Method 


In this section, we describe a simple iterative method for computing the 
eigenvector corresponding to the largest (in modulus) eigenvalue of a matrix 
A. We assume first that A is nondefective, i.e., A has a complete set of 
eigenvectors, and A has a unique simple? dominant eigenvalue. We will discuss 
more general cases later. 

Specifically, suppose that the n x n matrix A has a complete set of eigen- 
vectors corresponding to eigenvalues {A, His and the eigenvalues satisfy 


[Ai] > [Aa] = [As] = +++ = lal: (5.4) 


Since the eigenvectors {z;} are linearly independent, they form a basis for 
C”. That is, any vector q) € C” can be written 


Gg = Seay, (5.5) 
j=l 


2A simple eigenvalue is an eigenvalue corresponding to a root of multiplicity 1 of the char- 
acteristic equation 
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for some coefficients c;. Starting with initial guess g°), we define the sequence 
{q }vs1 by 


1 
gV@t) = — Ag”), v =0,1,2,... (5.6) 
Ov41 
where the sequence {a,},>1 consists of scale factors chosen to avoid overflow 
and underflow errors. From (5.5) and (5.6), we have 


es ve eee Jou 


Since by (5.4), |Aj/A1| < 1 for 7 > 2, we have lim (\j/A1)” = 0 for j > 2, 
and if c, 4 0, 


lim q) = jim [TI II x C121. (5.8) 


v—-oo 


The scale factors o; are usually chosen so that ||q ||. = 1 or ||q™|l2 = 
1 for v = 1,2,3,..., ie., the vector g”) is normalized to have unit norm; 
thus 0141 = ||Aq™ ||oo or ||Ag™ |J2, since q@’t+) = Aq /o 41. With either 
normalization, the limit in (5.8) exists; in fact, 


Ty 


lim g® = (5.9) 


71. ND? 
ad Ilz1| 


ie., the sequence q'”) converges if c, # 0 to an eigenvector of unit length 
corresponding to the dominant eigenvalue of A. 


REMARK 5.6 If q is chosen randomly, the probability that c, 4 0 is 
close to one, but not one. However, even if the exact q®) happens to have 
been chosen with c; = 0, rounding errors on the computer may still result in 
a component in direction 71. 


Example 5.3 
—-4 14 0 
A={-5 13 0 
-l1 0 2 


Ai = 6, A2 = 3, A3 = 2. (A is nondefective because all the eigenvalues are 
distinct.) Let q© = (1 1 1)7. Then, 


1 1 
g)) = ca = =, (10 8 1)7 =(1 0.8 0.1)" (choosing 0,41 = ||Aq™ loc). 
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1 
gq?) = —Agq™ = (1 0.75 —0.111)7 
02 


1 
gq) = —Aq® = (1 0.731 —0.188)7 
03 


q® = (1 0.722 —0.221)", 
g®) = (1 0.718 — 0.235)", 


gq?) = (1 0.714 — 0.250)", 
gq?) = (1 0.714 — 0.250)". 


Hence, the eigenvector corresponding to A; = 6 is approximately 
(1 0.714 —0.250)7. 


(Of course, as described later, we also obtain an approximation to j.) 


Consider again Eq. (5.7). Since by assumption Az is the eigenvalue of second 
largest absolute magnitude, we see that, for v sufficiently large, 


qe) = (x1 II o*) Ci X41 
i=l 
(A2/A1)" 


where k is a constant vector. Hence, 


jo o( d2 


len \Pa 


—-k asv7oow, 


) (5.10) 


That is, the rate of convergence of the sequence q‘”) to the exact eigenvector 
is governed by the ratio |A2/Ai|. In practice, this ratio may be too close to 
1, yielding a slow convergence rate. For instance, if |A2/Ai| = 0.95, then 
|A2/Ai|" < 0.1 only for v > 44, that is, it takes over 44 iterations to reduce 
the error in (5.10) by a factor of 10. On the other hand, the method is very 
simple, requiring n? multiplications to compute Aq”) at each iteration. Also, 
if A is sparse, the work is reduced, and only the nonzero elements of A need 
to be stored. 

Now having obtained a sequence of vectors converging to 21, we consider 
finding an approximation to A, itself. There are two popular ways of doing this 
in conjunction with the power method. In the first approach, one computes 
the approximation as 


(+1) 
(Aq™)i — Ov419% 
i ee (5.11) 
7 (a )x gi”? 
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where gt” is the k-th component of q), usually the largest in modulus. 
Substituting (5.7) into (5.11) yields 


(5.12) 


where 2x = (xj), = k-th component of z;. If c, # 0 and & is such that 
X1~ A 0, then 
A2 


vt1 =A O | |— 
My+1 1+ ( 


); y =1,2,3,.... (5.13) 


Thus, the sequence (5.11) converges to the dominant eigenvalue with the same 
rate of convergence as for the eigenvector. Note that if || - ||. is used in the 
normalization of o,, then |t41| = 0,41 for v sufficiently large. (Recall that 
O41 = |Aglloo and |Iq)|hoo = Ilq’*D|loo = 1. Also, g?*2 = q{” = EL in 
(5.11) if the maximum element is used for normalization.) 

A second way of finding an approximation to 1 is to use the Rayleigh 
quotient, i.e., compute for any v 


(qQ@)" Ag — (q@)" q@*Pova1 


ore = 5.14 
soe 5 (a a) (Qo) Ole 
Substituting (5.7) into (5.14) yields 
non ee v+1 x, Vv 
Mt wy (3) (+) ChCz (LZ Xj) 
jelkai \“t a1 
pivgy = tS (5.15) 
au on 
DDS) (GE) tees eees) 
Mt 1 
j=l k=1 


It can then be shown that the estimate (5.13) holds. However, in the special 
case of A having orthogonal eigenvectors, i.e., 2/2; = 0 for 7 4 k, such as for 
a real symmetric or Hermitian matrix, the sequence defined by (5.15) satisfies 


°) (5.16) 


r 
mam +o(|% 


1 


(To see this, note that 


d 2v+1 
ler|? + |e2|? a aces 
1 
M41 =A1 Ww ) 
2 2 | a2 
+ —| +... 
lei!” + lea| 


Figenvalue-Eigenvector Computation 301 


so 
d 2v 
lHv41 — AI] <e¢ = 

At 


for v sufficiently large.) 


REMARK 5.7 The more general class of normal matrices AY A = AA” 
has orthogonal eigenvectors. 


REMARK 5.8 If the q‘”)’s are normalized with respect to the j-norm, 
formula (5.14) becomes 


A 
poet = (G) al Poizi (5.17) 


since 
H 
(a) gq) =1 and oy41 =||Aq™ IIo. 


REMARK 5.9 Iterations are usually terminated by a criterion such as 
leg] <e, 


where € is a prescribed tolerance. 


REMARK 5.10 We made two assumptions above: 
(a) A is nondefective (A has a complete set of eigenvectors. ); 
(b) A has a unique dominant eigenvalue. 
However, if the dominant eigenvalue is unique but not simple, the power 


method will still converge. Suppose that A; has multiplicity r and has r 
linearly independent eigenvectors. Then, 


Tr n 
= > egy + Ss C545 
j=l jartl 


The sequence q‘”) will converge to the direction 


i 
) CjX;. 
j=1 
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However, if the dominant eigenvalue is not unique, e.g. 


004 

1 1 
A=[100], »=1, Gees: aos, 
010 52 3 5, 2 


then the power method will fail to converge. This severely limits the applica- 
bility of the power method. 


REMARK 5.11 Once a dominant eigenvalue and eigenvector have been 
found, a deflation technique may be applied to define a smaller matrix whose 
eigenvalues are the remaining eigenvalues of A. The power method is then 
applied to the smaller matrix. If all eigenvalues of A are simple with different 
magnitudes, this procedure can be used to find all the eigenvalues. (We will 
come back to this later.) 


Example 5.4 
(Eigenvalue computation) 


4-1 1 
A= —1 3-2 5 At = 6, r2 = 3, At Sl Y= (1 —1 je. 
1-2 3 


The eigenvector approximations are 


(1 —1/4 1/4) 

(1 —0.8333 0.8333) 
(1 —0.9118 0.9118) 
(1 —0.9884 0.9884) 
(1 —0.9942 0.9942) 


OowirkrRO’ 
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5.3. The Inverse Power Method 


The inverse power method has a faster rate of convergence than the power 
method, and can be used to compute any eigenvalue, not just the dominant 
one. A description of this method follows. 

Let A have eigenvalues 1, 2,.--,An corresponding to linearly indepen- 
dent eigenvectors 71,22,...,%n. (Here, the eigenvalues are not necessarily 
ordered.) Then, the matrix (A—AJ)~1 has eigenvalues (A— A1)~1, (A—A2) 71, 
..., (A— An) 7!, corresponding to eigenvectors 21, £2, ..., %n. Let q© bea 
starting vector, with 

n 
g® = S- Cj2; (5.18) 
j=l 


(assuming that the z;, 7 =1,2,...,n, are linearly independent.) We let 
- " CjX; 
(= Of Ayo ye. 
j=l 


Continuing in this manner, we let 


gS |OF =A) YG) = OT = Ay gh? )-= 


= 


(5.19) 


Suppose that A is a close approximation to 4, i.e., A is much nearer to A 
than to Ag,...,An- (It is not necessary that 41 be dominant.) Then, if A, is 
simple, (A — A1)~! is much greater than any of (A — A2)71, (A—A3)74, ..., 
(A—An)7!. Thus, if ce: 4 0, the term cy21/(A — \1)” will dominate in (5.19). 
If 

A=)" S (AA), 37 <2, 


: _ | 9) . (5.20) 


Thus, the iterates q”) converge in direction to 2. In practice, q‘”) would be 
normalized, as was done in the power method. 
The error estimate for the inverse power method analogous to (5.10) is 


@) — @i. hf |A=Ai|" 
OTe o (= i pen 


where q‘”) is normalized so that ||q(”|| = 1. 


then 


ea Ai — A 


q cir, + O ( 


REMARK 5.12 When 4 is close to Ai, the convergence of the inverse 
power method can be rapid. If A; = 1 and Ag = 0.95, then |A2/Ai|" < 0.1 
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requires vy > 44 for the power method. If A = 0.99 in the inverse power 
method, then 


0.99 —1 is 1 # 1 v 
ent (3) < 0.1 for v 2 2, and (3) < 10 or vy > 


0 


REMARK 5.13 If A =0, the inverse power method is simply the power 
method for A~!, and q‘”) converges to an eigenvector corresponding to the 
dominant eigenvalue of A~! (i.e., the inverse of the eigenvalue of least mag- 
nitude of A). 


REMARK 5.14 _ If A is real and ), is complex, then the choice X real 
results in |\ — Ai] = |A — Ag| where Az = A1. (Complex eigenvalues occur in 
conjugate pairs for A real.) The inverse power method will then not converge. 
For this reason, the inverse power method is often regarded as a method 
applicable to symmetric or Hermitian matrices. 


Consider now estimation of eigenvalues. Analogous to (5.11) and (5.14), we 
have 


A= tog (QM)E (q@), ~h=A 
and 
dn GE Ge 1 
es => (qt) y eG. (Rayleigh Quotient). (5.23) 
Example 5.5 
4-1 1 
A= | 2. Ly eS be76849,- Ae Ss 4691, Xe — 100098, 
—2 O -6 


The power method requires 22 iterations to find Ay * —5.7685 = jog and 
q2?) = (—0.1157 —0.1306 1)7. With \ = —5.77, the inverse power method 
requires 4 iterations to find 1/(A — 4) & —672.438, and thus, “44 = —5.7685 © 
a1. 


REMARK 5.15 To calculate q+) from q, generally (AJ — A)~! is 
not computed, but the following system is solved for q‘’+") instead: 


(AT — A)g’*) = q™ (q+ is then scaled). (5.24) 
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In general, Gaussian elimination with pivoting is used to solve (5.24). The 
method initially requires O(n*) multiplications, but the multipliers are saved 
to reduce further computations. 


REMARK 5.16 The selection of X can be based on some other method, 
or perhaps Gerschgorin’s Theorem can be used. 


5.4 Deflation 


Suppose that A; and x, are determined for an n x n matrix A using the 
power method or inverse power method. The goal of deflation is to find a 
simpler (n — 1) x (n — 1) matrix whose eigenvalues are those of A except for 
A,. The power method or inverse power method can then be applied to the 
simpler matrix to find another eigenvalue, say A2. 

To begin, let x be an eigenvector of A corresponding to eigenvalue \. Let 
U be an n x (n — 1) matrix such that (x,U) is unitary. Since Ax = Ax and 
A(a,U) = (Aa, AU), we have 


H H A 
H _(@ — frAat*x «7% AU 


Now 24a = 1 and z is orthogonal to the columns of U, i.e., U4x = 0. Thus, 
H 
(0,U) A(w, 0) = @ i ) | (5.25) 


where C = U# AU is (n—1) by (n—1) and h¥ = x7 AU. 

The matrix in (5.25) is block triangular and has for its eigenvalues \ and the 
eigenvalues of C. But the matrix in (5.25) is obtained from A by a similarity 
transformation, so it has the same eigenvalues as A. Thus, C’ has the same 
eigenvalues as A, except for A. We are thus left with the following question: 
How do we determine (z,U)? It turns out that («,U) can be determined 
by means of a Householder transformation, as we saw in §3.3.8 (starting on 
page 132). 

Let’s briefly review Householder transformations. Recall that if ||w||2 = 1, 
the n x n matrix W = I —2ww® is called a Householder transformation. We 
saw that Householder transformations have several interesting properties: 


(a) W=W# (W is Hermitian). 


(b) W4W =W? =I (W is unitary). 
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(c) Ifa € R”, x; is the first component of x, « # 0, o = sign(x1)||x||2 where 
sign(0) = +1, and if 


1 1 
w=a+oe and 0= sllwlle: then W = I — que 


is a Householder transformation, and Wa = —ce). 


Property (c) indicates that Householder transformations have the capability 
of introducing zeros into vectors. 

Now recall that x is an eigenvector corresponding to A. We will construct 
a Householder transformation T using property (c) such that 


Tx = —sign(21)||a||2e = —sign(a1)e™, 
assuming that ||z|/2 = 1. Then, 
T?x = x = —sign(x,)Te™. 
Thus, Te) = —sign(x)a. Therefore, the first column of T is the eigenvector 
—sign(#1)x. Hence, T = (—sign(x,)x,U), where U is unitary. We have thus 
found U. In summary, the deflation procedure is: 


(i) Begin with A, x an eigenpair of A with ||z||2 = 1. 


(ii) Compute T such that Tx = —sign(2,)||z||2e. 
(o < sign(a1)||a|l2, w — « + oe, 6 — (1/2)||w||2, T — I — ww" /0.) 


(iii) Find U through the relation T = (—sign(x1)a,U), where U is nx (n—1). 
(iv) C— U# AU. 


By (5.25), 


fic. PARE 
(0 


and the (n — 1) x (n— 1) matrix C has the same eigenvalues as A except for 
Ai. The matrix C’ can then be used in conjunction with the power method or 
inverse power method to find a second eigenvalue of A. 


REMARK 5.17 Other deflation methods, such as Wielandt’s deflation 
method, are available, but it can be shown that the deflation method we have 
just described is not seriously affected by rounding errors. 
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5.56 The QR Method 


The QR method is an iterative method for reducing a matrix to triangular 
form using orthogonal similarity transformations. When applied to a Hessen- 
berg matrix,? the QR method is very competitive. The QR method is one 
of the best known methods for finding all the eigenvalues of a matrix. The 
eigenvectors may be found by transforming back the eigenvectors of the final 
triangular matrix. 

Computing eigenvalues and eigenvectors by the QR method involves two 
steps: 


1. reducing the matrix to Hessenberg form, and 


2. iteration by QR decompositions. 


5.5.1 Reduction to Hessenberg or Tridiagonal Form 


The number of operations required to complete one QR transformation is 
proportional to n? for a full matrix but only to n? for a Hessenberg matrix. 
Thus, in the QR method, the matrix A is first transformed to (upper) Hessen- 
berg form, then the QR method is applied to the upper Hessenberg matrix. 


DEFINITION 5.3 A matrix A is (upper) Hessenberg if A has the form 


411 412... Ain 
421 422... a2n 


A-| O aso a33 ... Aan 


0 0 0 QAnn-1 Ann 


that is, if A has nonzeros only on and above the main diagonal and in the 
positions one index below the main diagonal. 


To reduce A to Hessenberg form, we employ Householder transformations 
in the following procedure. Let A; = A be a given n x n matrix. At the k-th 
step, the reduced matrix A, has the form 


Aa ae 
A, = () 40) ; (5.26) 
0 a2 23 


3We will define “Hessenberg matrix” shortly. 
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where A(t) is k x (k —1), al) is k x 1, aS is (n—k) x 1, AMD is k x (n—k) 
and AM) is (n—k) x (n—k). Let H;, be an (n — k) x (n — k) Householder 
matrix such that 


Hal = +a Je , where e“) is the first unit vector inR"~*, — (5.27) 
and let 
Ty 0 A 2 
Uy, = 0 Hy} (Recall that H,° = H;, and H; = I.) (5.28) 


(See Lemma 3.3 on page 134 for details on construction of the transformation.) 
Then, 


Alt) al? A® Er, 
Anyi = UF AU, =U, AU, = | 7 , (5.29) 


0 


k 1 k 
+ |laSp |e Hy. ASP He 


which carries the reduction to Hessenberg form one step further. The com- 
plete reduction requires approximately on3 multiplications (Exercise 12 on 
page 321). The final Hessenberg matrix is the matrix A,_1 given by 


Agog = 008 UAC UU. (5.30) 


Of course, if U = UjU2...Un_—2, then U is orthogonal, and A,_, is similar to 
A, ie., 
A, = OF AU SU AU = DAU, (5.31) 


REMARK 5.18 _ If A is Hermitian, then (5.31) implies that A,_1 is 
Hermitian. Since a Hermitian (upper) Hessenberg matrix is tridiagonal, we see 
that A,_1 is tridiagonal. Furthermore, for A Hermitian, the algorithm to find 
A,y—1 can be implemented in such a way that only about on? multiplications 


are required (Exercise 13 on page 321). 


5.5.2 The QR Method with Origin Shifts 


We first describe the algorithm. Let A = Ao be a given n x n complex 
matrix, preferably in Hessenberg or tridiagonal form. Then, the QR algorithm 
produces a sequence of matrices Aj, Ai, Az, ... determined as follows: 


(a) Given A,, determine the scalar origin shift u, from the elements of A,, 
(b) Factor the matrix A, — y,J into the form 
Ay — yl = QR, (5.32) 


where Q, is unitary and R, is upper triangular. (This is the orthogonal 
triangularization or QR decomposition of A, — pI.) 
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(c) Compute A,+1 by the formula 
Aysi = RLQL t+ wl. (5.33) 


REMARK 5.19 As the iteration continues, the scalars ju, converge to 
an eigenvalue of A. (We will explain how to choose ju, later.) 


REMARK 5.20 By Theorem 3.13 (page 136), the factorization (5.32) 
exists. In addition, if A, — J is nonsingular, the factorization is unique. To 
see this, let A, — ppl = QR, with Qu and R,, having the required properties. 
Then, 

(A, _ ple A; = py 1) = B,= ROP Oy R, = Re dts 
and by (5.32), By = Rf R,. Thus, B, has the Cholesky factorizations B, = 
Re k, = = Re R.. By uniqueness of the Cholesky factorization, Ry, = R,; 
Q, = Q, then follows from nonsingularity of R,. 


REMARK 5.21 From (5.32) and (5.33), we see that R, = Q? (A, — pI) 

and Ayai = QF(A, — DQ, + wl = QE A,Q,. Thus, A,+1 is unitarily 
similar to A, and therefore has the same eigenvalues. Also, if A, is Si 
Hessenberg, then A,+1 is (upper) Hessenberg (Exercise 14 on page 321). 


REMARK 5.22_ A single iteration of the QR method requires a multiple 
of n° multiplications. However, if A is (upper) Hessenberg, the method only 
requires a multiple of n? multiplications, and if A is tridiagonal, only on 
multiplications are required (Exercise 15 on page 321). 


5.5.2.1 Convergence of the QR Method 


We have the following question: Why does the QR method converge? That 
is, why does the sequence {A,} tend to a triangular matrix that is unitarily 
similar to Ag? 

We will see that the QR method is related to the power method and also 
the inverse power method. Let 


Oy O6001..@y “end Ry = RL Ry a. «Ro: (5.34) 
Since Ay41 = QF ALOs; we have 
Ayii = = Qu AQ, (5.35) 


and, since Q, is unitary, 


(Av41 — yl) = QH(A- Lv DQv. (5.36) 


The following result connects the power method to the QR method. 
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PROPOSITION 5.5 
In the above notation 


Q Ry = (A— prI)(A— pyp_1D)...(A— pol). (5.37) 


PROOF For v = 0, (5.37) simply defines Qo and Ro. Assume now (5.37) 
holds for vy = k — 1. From (5.33) and (5.36), we have 
Ry = (Anti — bel) Qe 
= QE (A— pel) QnQe 
= Qi (A= Mel) Qr—1 
Multiplying both sides on the right by Ry—1 results in 
Re = QE (A= tel) Qr—1Re-1, 
so . ee 
QeRe = (A — wel) Qx-1Re-1- 
Therefore, (5.37) holds by induction. 
To see the significance of (5.37), consider the unshifted QR method, ie., 
set 0 = po = fy =--- = py. Then, (5.37) becomes A”+! = Q,R,, ie., QUR 


is the QR decomposition of A’+!. Also, since Rye = Ye () (ecwage R, 
is upper triangular), the first column of Q, satisfies 


QuRyeD = HY g® = artte™, 


Thus, gS is the vector obtained by applying (v+1) steps of the power method 
to the initial vector e). If A has a dominant eigenvalue \; and e“ has a 
nonzero component in the direction of the eigenvector corresponding to Aj, 
then qs approaches that eigenvector. 

Now partition A, into the form 


Ay = au he 
On Gog’ 
and do likewise for A,,1. Analogous to the unitary matrix in deflation (see 
(1) 


page 305), we partition Q, into its first column gq; ’and remaining columns 


U, to obtain 
Api = OF A,Qy = —_ (qi? ’ Uy Ay (q  U,) 


go Bea gl AGU, ( ba 


UH Ag) UHA,U, 


Qv+1 Cy41 
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But A,gs? — Ayq™ and UF ig?) = =A ay M) = = 0, since QF is unitary. 
Thus, UH A,’ = = 941 ~ Oas vy > cw and 


i = (qs?) 7 Aq 5 rr(q?) Fg? = AL as v > ow. 


Now let’s consider the shifted QR method in relation to the inverse power 
method. From (5.32), we have 


Quy = (AP —7,1) RE, 


since QF = R,(A, —pI)~!. Since RF is lower triangular, RF e™ = Fpnre™, 
where rj; is the 7, j-th element of R,. Now, let qs 
of Q,. Then, 


represent the i-th column 


af = Que = (AF BDO REe™ =Fan(AY — B,D te™. (6.38) 


Thus, the last column of Q,, that is, gs”, is the approximate eigenvector of 
A” that results from applying one step of the inverse power method with 
shift 77, to the vector e(. This observation has the following consequences 
in terms of A, 41. Partition A, into the form 


On diy 
A= | on gw} (5.39) 


with A,41 partitioned accordingly, i.e., 


A Cy41 hy4t 
v+1 = H v+1 s 
Gy41 Gnn 


Consider A,41 = QE ALQ» with 


Op Ua): 
Then, 


UF ALU, be Avg 
Avi S (UL, PAGS, gs”) = H 
dh AU, gf Avg 


If 77, is near an eigenvalue of A”, then qs” 


vector than e); thus, g#,, = gs” "" A,Uy, gives 


will be nearer an accurate eigen- 


Qe = UA () 3 Nn UF q (nm) — Q, 


since the columns of U, are orthogonal to qs”. Thus, |gv-+41|| will be smaller 
than ||g,||. In fact, judging from the convergence rate of the inverse power 
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method, the vectors g, may approach zero rapidly if the 4,’s ever approximate 
an eigenvalue. 

In (5.38), the natural choice for Zi, is the Rayleigh quotient (e™)”4 Av e™, 
i.e., 


by = al), (5.40) 


Notice that, then, 
abet) = (af) Ava & Ao 


nn 


Once gy is “sufficiently small,” we may neglect it and restart the QR process 
with the smaller matrix C,. Thus, the QR method “naturally deflates” A. 


REMARK 5.23 Suppose that A is real and (upper) Hessenberg and 
we wish to work in real arithmetic. Then, we must choose real shifts s,; 
therefore, we cannot approximate a complex eigenvalue well. In the next 
section, we present a variant of the QR method, in which complex eigenvalues 
(which occur in conjugate pairs since A is real), can be determined by solving 
2 x 2 eigenvalue problems. 


REMARK 5.24 _ If A is symmetric, then its eigenvalues are real, and 
we do not have the problem of Remark 5.23. Therefore, the method of this 
section, coupled with preliminary reduction to tridiagonal form, is an effective 
and popular way to compute all the eigenvalues of a symmetric matrix. 


REMARK 5.25 In the symmetric case, if we choose the shifts pu, to 
be al), then the process usually converges rapidly. In practice, the shift is 


chosen to be the eigenvalue of 


erie 5 iis 
ceten a), 


closest to al). Then it can be shown that the QR method always converges 


and generally converges rapidly. 


REMARK 5.26 Because of the relationship of the QR method to the 
power and inverse power methods, the QR method is substantially slowed 
down if the eigenvalues of matrix A are packed closely together. That is, the 
QR method is most effective when the eigenvalues of A are separated. 


5.5.3 The Double QR Algorithm 


Since this algorithm uses only real arithmetic, this algorithm is advanta- 
geous when A is real and not all the eigenvalues of A are real. Consider two 
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steps of the QR method, ie., (5.32) and (5.33); in other words, 


Ay a byt = QUR 

Ays1 = RvQyt pt 

Apsi — bygil = Qy41Rv41 
Apso = Ry4iQv4it brit. 


From (5.41), we have A,+2 = QQ? Av41Qp41 — QQ? ALQLQi41. Thus, 


Ay+2 = (Q,0) 4)" An Qveeen). (5.42) 
An interesting result is the fact that 


QvQv4i1 RvR, = Qy(Avsi — My 4it) Ry 
= QVAL41 Ry — fv 41Q Rk 
= Q)(RLQu + wD Ry — br 1Q Rk, 
= QLRLQURy + by Qv Rp — My 4i1QuR 
= (QUR, + wD)QLRy — fy 4iQ .R 
= A,QURy — Mr 41Q.R 
= (A, — wNQR 
= (Ay = pi l)(Ar — wD) 


(5.41) 


Thus, 

QQ 41 Ri41Rv = (AL = Mr+411) (A, = Ut). (5.43) 
Thus, if A is real and 41 = Z,, the matrix on the right in (5.43) is real, and 
this implies that Q,Q +1 and R,+1R, are real. (If A is real, the Householder 
transformations used to construct the QR decomposition are real.) Hence, 


each element of the sequence A,, As, As, ... is real, since A, is real by 
assumption and A3 = (QiQ2)".A1(Q1Q2), As = (Q3Q4)" A3(Q3Qa), ... b 

5.42). 

i can avoid computing Ag, Ag, ... as the following procedure. We need 
QvQr41 to compute Ay+2 by (5.42), Le, Avge = (QrQv41)" Ar (QrQ_41). 
However, we can find Q,Qv+1 without aie Av4i. We decompose (AL _ 
by I) (A, o vt) = Quy Re. Then, Q= _— Q1Q141 by (5. 43), A v+2 — =F ALQv by 


(5.42), ae = have found rae with real arithmetic and without computing 
Av4i. Thus, Ay, A3, As, ... can be computed using this procedure. The 
price we pay for this convenience is at step v of the double QR algorithm, 
(Ay — pv I)(AL — 7,1) must computed. 


REMARK 5.27 The shifts yu, and pw +1 are usually chosen to be the 
eigenvalues of 


(5.44) 
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It is easy to verify that pu, and p41 satisfy 


Hy + Mo. = teen ar: al 


and 
(v) (v) (v) 


—qg™ 
My byt+1 = ann an 1jn 1— an 1jn’n,n—-1? 


and also 


(Ay pe (Aye = AP — (eg aM )A, 
+(aa” = al”? a i. 


nn “~n—1,n—-1 


Hence, if the eigenvalues of (5.44) are complex conjugates, i.e., 7, = My+1, 
then the double QR method accomplishes in real arithmetic the effect of 
performing single shifts, with one shift 4, and one shift 77,. 


REMARK 5.28 Of course, since the elements of A, are real for v odd, 
the element al) cannot converge to a complex eigenvalue. What happens is 
that the 2 x 2 submatrix in the trailing position eventually converges. The 
eigenvalues of the 2 x 2 matrix approximate a pair of complex conjugate 
eigenvalues of the original matrix. Thus, the double QR method will even- 
tually produce a matrix orthogonally similar to the original matrix, and the 
eigenvalues can be found by solving 2 x 2 eigenvalue problems for complex 
eigenvalues or 1 x 1 problems for real eigenvalues. 


REMARK 5.29 The double QR method is not guaranteed to converge. 
For instance, the matrix 


0000 1 
100 0 0 
0 10 0 0 
00 1 0 0 
000 1 0 


is left unchanged by a double QR step, and therefore none of the subdiagonal 
elements converge to zero. In practice, when the method does not converge, 
a couple of applications of the single shift QR method with randomly cho- 
sen shifts generally produces a matrix for which the double QR method will 
converge. 


Example 5.6 
1490 695 —600 A = 34 V3 = 4.7321, 
A=7ye5 | 695 1635-175], 4 Ar = 3.0, 
(25) \ _ 609 -175 2500 Ag = 3— V3 = 1.2679. 
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The Householder transformation 


1 7 —24 0 
U= oT —24 -—7 0O 
0 0 25 


transforms A to Hessenberg form 


210 
A=U#AU=]1 31 
014 


Choosing the shifts uw, = a¥,, (see Eq. (5.40), page 312) gives the following 
iterates in the QR method: 


< eee _ (1.4000 0.4899 0 
A,={13 1], As= [0.4899 3.2667 0.7454 | , 
014 0 0.7454 4.3333 
_ 1.2915 0.2017 0 _ 1.2739 0.0993 0 
Az = | 0.2017 3.0202 0.2724], As = [0.0993 2.9943 0.0072 ] , 
0 0.2724 4.6884 0 0.0072 4.7320 
_ (1.2694 0.0498 0 
As = | 0.0498 2.9986 0 
0 0 4.7321 


At this point, the QR method can be applied to the smaller 2 x 2 matrix 


1.2694 0.0498 
0.0498 2.9986 } ° 


5.6 Jacobi Diagonalization (Jacobi Method) 


The Jacobi method for computing eigenvalues is one of the oldest numerical 
methods for the eigenvalue problem. It was replaced by the QR algorithm as 
the method of choice in the 1960’s. However, it is making a comeback due 
to its adaptability to parallel computers [40, 97]. We give a brief description 
of the Jacobi method in this section. Let A“) = A be an n x n symmetric 
matrix. The procedure consists of 


AY — NE AMON, (5.45) 
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where the N;, are unitary matrices that eliminate the off-diagonal element of 
largest modulus. 


REMARK 5.30 It can be shown that if ast) is the off-diagonal element 
of largest modulus, one transformation increases the sum of the squares of 
the diagonal elements by 2a; and at the same time decreases the sum of the 
squares of the off-diagonal elements by the same amount. Thus, A‘ tends 
to a diagonal matrix as k — oo. 


REMARK 5.31 Since A‘ issymmetric, A?) = N// AM N, is symmetric, 
so A®), AM... are symmetric. Also, since A(t is similar to AM, AG@+) 
has the same eigenvalues as A“), and hence has the same eigenvalues as A. 


We now consider how to find Nx such that the largest off-diagonal element 
of A“) is eliminated. Let 


(k) (k) 


Qi, +++ Ain 
A®) = - , 
aoa 


and suppose that Ja(*) | > jar for 1 <i,7 <n. Let 


row — 


N;, is a Givens transformation (also called a plane rotator or Jacobi rotator), 
as we explained starting on page 137. Note that N/iN, =I (Nj; is unitary). 
When A(*+ is constructed, only rows p and q and columns p and q of A(*t)) 
are different from those of A“). The choice for a; is such that ey = 0. 
That is, since 


alk) = (—al*) + al*)) COS Ap SIN AR + alk) (cos? a, — sin? az) = 0, 
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cos @, and sina, are chosen so that 


(k) (k) (k) (k) 
2 App — qq ie) 1 app = aqq 
cos =} ee Ee, PRE a d 
Ak 5 + 7 > sin” arg 5 a » an 
qak*) 
sin Ap COSa,E = aes 
r 


where r? = (a\®) - aff)? + 4(a))2, 
In summary, the Jacobi computational algorithm consists of the following 
steps, where the third step provides stability with respect to rounding errors: 
(1) At step k, find aft) such that p 4 q and |as*) | > jac for 1 <i,j <n, 
iXj. 
(2) Set r= (a\i? - akt))2 + 4(a\*))2 and t = 0.5 + (al) - ak) /2r. 


(3) Set chk = (al) — af). 
IF chk > 0 THEN 


set c= Vi and s = ak) /(re), 


ELSE 
set s= /l—tandc= ak) /(rs). 
END IF 
(4) Set 
1 ifi=j, 
ift =p, jJ=p, 
Nui=) 5s ifizg, jap, 
ce ifi=4,j=4q 
0 otherwise. 
(5) Set A@+) = NTAMN, 
(6) Go to Step (1) until Ja’) | <eé. 


Example 5.7 


Sb SF © 
oe 
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Then, 


A@) = NEAON, = 


gh ak © 
ow Slr 
“I (on) Sl- 


Notice that the sum of the squares of the diagonal elements of A) is 8 more 
than the sum of the squares of the diagonal elements of A“), and the sum of 
the squares of the off-diagonal elements of A‘) is 8 less than the sum of the 
squares of the off-diagonal elements of A“), 


5.7 Simultaneous Iteration (Subspace Iteration) 


Simultaneous iteration methods are extensions of the power method in 
which several vectors are processed simultaneously in such a way that the 
subspace spanned by these vectors converges to the subspace of the dominant 
set of eigenvectors of the matrix. 

In this section, it is assumed that the n x n matrix A has linearly indepen- 
dent eigenvectors v1, V2, ..., Un and associated eigenvalues 1, A2, .--, An 
satisfying 


|Ai] 2 |A2] 2 +++ 2 [Aw] > [Anza] 2 +++ 2 |An| > 0. 


Let y = span(v1, U2,..., Uk); y is a k-dimensional subspace of C”. 

In simultaneous iteration methods, a matrix U° = (1, uO, ee uO) is 
randomly selected. At the m-th step, V°™ = AU”. To prevent all the 
trial vectors from converging to the dominant eigenvector, a new set of trial 
vectors U("+)) is obtained from V(™ by an orthonormalization procedure. 


The following procedure may be used: 
(a) Let uw), us, we uo) be a set of orthonormal vectors for the subspace 


4. 


Yo = span(u;’, (9), 


.,U 


(b) For m= 0, 1, 2,... 


(i) Calculate Aus, Aus, a Au, which is a basis for 
Ym+1 = span (4uf”, Aus, ae Aut”) : 


(ii) Orthonormalize the above vectors by the Gram-—Schmidt process 


to obtain 
ee afr) ae? 


pee 
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and 


Ym+1 = span ao usr), oe ue) : 


(Notice that the Gram—Schmidt procedure preserves the subspace.) 


It is straightforward to see that ul — ue as m — oo, where 


span (u1, U2, oan , Uk) = he span(v1, V2, aoe SDs): 


(See [97].) To see this, let 


n nm n 
Yo = Span s Ci1 Vi, S Ci2QVin-+ +5 s CikUi | - 
i=l i=l i=l 


Then 


o] 


n n n 
4 m m m 
Ym = span y ci dz" Ui, 5 Cif Vis 5 Cik AY Vi 
j=l i=l i=1 
> 7 =span(vj1,v2,...,Uk), asm— oo, 


since |Ax| > |Ag+1|. (Notice also, if A is symmetric, then the first & dominant 
eigenvectors of A are ue, €=1,2,...,k.) 


5.8 Exercises 


1. If 
2-1 0 
A= |{-1 2-1], 
0-1 2 
then 


(a) Use the Gerschgorin theorem to bound the eigenvalues of A. 


(b) Compute the eigenvalues and eigenvectors of A directly from the 
definition, and compare with the results you obtained by using 
Gerschgorin’s circle theorem. 


A=(j a 


(a) Compute the eigenvalues of A directly from the definition. 


2. Consider 


(b) Apply the Gerschgorin circle theorem to the matrix A. 
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(c) Why is A not a counterexample to the Gerschgorin theorem for 
Hermitian matrices (Corollary 5.1)? 


Let 


RF wk © Ale 
CO Ole ole 


Show that the spectral radius p(A) < 


. Let A be a diagonally dominant matrix. Can zero be in the spectrum 


of A? 


Prove Theorem 3.21 (on page 160), using notation and procedures from 
this chapter. 

Hint: You may wish to try proving the theorem first in the case that 
there are n distinct eigenvalues. You may also wish to consider the 
Schur decomposition (on page 101), although this is not the only way 
this theorem can be proven. 


Apply several iterations of the power method to the matrix from Prob- 
lem 1 on page 319, using several different starting vectors. Compare the 
results to the results you obtained from Problem 1 on page 319. 


Let A be a real symmetric n x n matrix with dominant eigenvalues 
Ay = 1 and Ag = —1. Show that the power method fails to converge. 
Determine the behavior of successive iterates of the power method. 


. Suppose that A; and x, have been obtained for a real symmetric n x n 


matrix A using the power method, and 
[Aa] > Az] > [As] > [Aa] >... > [And 
Let 
qQO = 2 — (2" ay)ar, 


21) = Ag(”), and 
gtth =e!) = (2e4y" ty) 24, 


where 2) is randomly chosen. Show that q(t!) /\5+! — coa2, where 

x2 is the eigenvector corresponding to »2 and cz is a constant. (Note 

that the eigenvectors of a symmetric matrix A are orthogonal, that is, 
T T 


zr; 24, =0 if 7 £7. Also, assume 27 21 = 1.) 


Apply several iterations of the inverse power method to the matrix from 
Problem 1 on page 319, using several different starting vectors, and 
using the centers of the Gerschgorin circles as estimates for 4. Compare 
the results to the results you obtained from Problem 6 on page 320. 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 
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Let A be a (2n+1) x (2n+1) symmetric matrix with elements a;; = (1.5)’ 
if i = j and a,j = (0.5)'*9~! if i # j. Let the eigenvalues of A be Xj, 
i= 1,2,...,2n+1, ordered such that Ay < Ag < ... < Aan < Aan41. 
We wish to compute the eigenvector 2,41 associated with the middle 
eigenvalue \,,41 using the inverse power method gq” = (AJ—A)~1'q"~! for 
r =1,2,... Considering Gerschgorin’s Theorem for symmetric matrices, 
choose a value for \ that would ensure rapid convergence. Explain how 
you chose this value. 


Compute all of the eigenvalues of the matrix from Problem 1 on page 319, 
using the power method with deflation. Compare the results to the re- 
sults you obtained from Problem 1 on page 319. 


Show that reduction of an n by n matrix to upper Hessenberg form 
requires (5/3)n? + O(n?) multiplications. 


Show that reduction of an n by n Hermitian matrix to upper Hessenberg 
form can be done with (2/3)n? + O(n?) multiplications. 


Show that, in QR iteration (formulas (5.32) and (5.33) on page 308), 
A,+1 is upper Hessenberg whenever A,, is. 


Prove the assertions in Remark 5.22 on page 309. 


Let the 3x 3 matrix A have eigenvalues 1 = 2, A2 = 4, and A3 = 6, with 
associated eigenvectors v1, v2, and v3. Consider the iteration method 


Cep1 = —(At+ 31) -1(A = Bl) ae, 


where xp = 01 + V2 + v3. Prove that 
Cc 
|zx — z|| < 3R 
for some constant c > 0, where z is one of the eigenvectors v1, v2, or U3. 


Let A be an n x n matrix and let b € R". Let K be the n x n matrix 
K = (ky, ke,+++ ky) with j-th column k; = AJ~1b for j = 1, 2,..., n. 
Assume that K is nonsingular. Let c= —K~!A"b. 


(a) Show that 
AK = K (eo, e3,-++€n, —C), 


where e; is the i-th column of the identity matrix, so 
K~'AK = (e2,€3,*+:€n,—¢€) =C. 


(b) Prove that A and C have the same eigenvalues. 
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(c) Show that 


det(C' — AI) = (-1)" (» 4 3 a] = pd). 


i=l 


(d) Based on (a), (b), and (c), explain why calculation of the eigenval- 
ues of an n x n matrix with n > 5 generally cannot be performed 
in a finite number of steps. 


Chapter 6 


Numerical Differentiation and 
Integration 


In this chapter, we study the fundamental problem of approximating integrals 
and derivatives. 


6.1 Numerical Differentiation 


There are two common ways to develop approximations to derivatives, using 
Taylor’s formula or Lagrange interpolation. 


6.1.1 Derivation with Taylor’s Theorem 


Consider applying Taylor’s formula for approximating derivatives. Suppose 
that f € C?[a, b]. We wish to approximate f’(xo) for zo € (a,b). By Taylor’s 
formula, 


(x — x0)? 


ao rela) 


f(x) = f(xo) + f'(@o)(a — xo) + 
for some € between x and xo. Thus, letting x = x +h, 


f(zo +h) — f(xo) — h 


f(xy) = Laat = Sao) _ B pve) 
Hence, 
f'(#o) = Pao $1) = ro) + O(h) (forward-difference formula). (6.1) 


To obtain a better approximation, suppose that f € C?{a,b] and consider 


2 3 
Flvo +h) = Flo) + f'(wo)h + Fao) + FG) i 
2 3 : 
Fao — h) = f (20) — f'(ao)h-+ F" a0) 5 — F(a) 
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Subtracting the above two expressions and dividing by 2h gives 


f(xo +h) — f(xo — h) 


oh + O(h?) (central-difference formula). (6.3) 


f' (#0) = 


Similarly, we can go out one more term in (6.2) (assuming f € C*{a,)J). 
Adding the two resulting expressions and dividing by h? then gives 


$0) = sy Lf ao — B) ~ 2F(00) + Flo + AY] — So LF) + FO). 


Hence, using the Intermediate Value Theorem, 


F(a) = ot — PF 0) + FOO NF ging 6.4) 


Example 6.1 
f(x) =alnz. Estimate f”(2) using (6.4) with h = 0.1. Doing so, we obtain 


nig) mw f(2-1) = 2f(2) + F9) _ 
f"(2) & =o Gaye 


(Notice that f”(2) = 1/2, so the approximation is accurate.) 


6.1.2. Derivation via Lagrange Polynomial Representation 


Now consider the Lagrange polynomial method for approximating deriva- 
tives. Let {9,21,...,%n} be n+ 1 distinct points on [a,b] and assume that 
Lj41 — £; = h (not necessary) for 7 = 0,1,2,...,.2 1. The Lagrange 
interpolating polynomial to f(x) of degree at most n through the points 
(xo, f(2o)), (x1, f(x1)), as) (fn, fn) is 


p(x) = S> fe) Le(a), where Ly(x) = TJ ( a “) (6.5) 
k=0 


and 


for some €(x) € [a,b]. Hence, 


’ “ F d aS (n+1) (€(q 
f(a) = So feoti(e) + 4 (Ie “| ar 
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Il ) fr) (E(xe)) 


(n+ 1) for x = x¢. (6.7) 


[[(ec- 2 | Leo) 


G+ 1)! is generally small. 


Consider, for example, n = 2 with 41 = 2% +h, v2 = 2, +h. Then, 


2x49 —2%1 — 2X2 229 —X0 — XQ 


f'(xe) = f (xo) ee E + F(e1) Cece 


2 
+ ef" (Ee) IT (xe — 2). 
<0 


+f(e2)| 


2x — Xo —Xy 
(x2 — £0)(X2 — 21) 


arn 


Taking xe = x1 gives 


f'(01) = ~ 5 Flo) + OF (1) + sof laa) — EE)? 
1 
~ oh 


[f(a1 +h) — f(a1 — h)) ; f'"(€:)h? (central difference formula). 


By considering 5 points (n = 4), derivative formulas of O(h*) can be derived. 
In particular, 


Fi (00) = pls eo — 2h) — 8f (eo — hy) (6.8) 
H8F(20 +h) — Fle + 2h) + © FE) 

(00) = Tr l-25 (00) + 48F (a0 +h) — 36 (00 + 2h) 
+16 f (ao + 3h) — 3f(xo + 4h)] + 716). (6.9) 


6.1.3 Error Analysis 


One difficulty with numerical differentiation is that rounding error can be 
large if h is too small. In the computer, f(a +h) = f(vo + h) + e(xo + h) 


and f (xo) = f(o) + e(ao), where e(a9 +h) and e(xo) are roundoff errors that 
depend on the number of digits used by the computer. 
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Consider the forward-difference formula 


f(a) = Lt = Moo) 2 prea), 


We will assume that |e(«)| < e|f(x)| for some relative error ¢, that | f(a)| < Mo 
for some constant Mo, and that |f”(x)| < M2 for some constant Mo, for all 
values of x near zo that are being considered. Then, these assumed bounds 
and repeated application of the triangle inequality give 


eines f(ao+ ) — f(ao) < e(ao + 0 — e(x9) 2 1 Ms 
2€Mo hM, _ 
< ; oS E(h), (6.10) 


where € is any number such that |e(x)| < e|f(x)| for all x under consideration. 
That is, the error is bounded by a curve such as that in Figure 6.1. Thus, if 


FIGURE 6.1: Illustration of the total error (roundoff plus truncation) 
bound in forward difference quotient approximation to f’. 


the value of h is too small, the error can be large. 


Example 6.2 


x - ae _ = — In(B+h)—In(3) 
(Using 10-digit precision) f(z) =Inz, 4 = f/3). =. 
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h inGrh)—In(8) 

10~! | 0.32790 

10? | 0.33278 

10-3 | 0.33328 

10-4 | 0.33332 best estimate 

10-° | 0.33330 

10-8 | 0.33300 

10-7 | 0.33000 

10-8 | 0.30000 

10-9 | 0 
To analyze this example, notice that f” (a) = -+ and Mz = max |f”(£)| = 3. 
Suppose that the error is 2 Mo+ 2 Mp, so e(h) = 26 My +4Mp. The minimum 
error occurs at e'(h) = 0, which gives hop, © V36e. For ten significant digit 
accuracy, €* 5 x 10-19. Thus, Aopt © 10-4. 


If we use calculus to minimize the expression on the right of (6.10) with 
respect to h, we obtain 
2 eMo 


VMp ” 


opt = 
with a minimal bound on the error of 


E(hopt) = 2\/MoMove. 


Although the right member of (6.10) is merely a bound, we see that hopt gives 
a good estimate for the optimal step in the divided difference, and E(hopt) 
gives a good estimate for the minimum achievable error. In particular, the 
minimum achievable error is O(,/e) and the optimal h is also O(,/e), both in 
the estimates and in the numerical experiments in Example 6.2. 

With higher-order formulas, we can obtain a smaller total error bound, at 
the expense of additional complication. In particular, if the roundoff error is 
O(1/h) and the truncation error is O(h”), then the optimal h is O(e!/("+)) 
and the minimum achievable error bound is O(e"/("+)), 


6.2 Automatic (Computational) Differentiation 


Numerical differentiation has been used extensively in the past, e.g. for 
computing the derivative f’ for use in Newton’s method.! Another example 
of the use of such derivative formulas is in the construction of methods for 


lor the multidimensional analog, as described in §8.1 on page 439 
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the solution of boundary value problems in differential equations, such as is 
explained in §10.1.2, starting on page 540 below. However, as we have just 
seen (in §6.1.3 above) roundoff error limits the accuracy of finite-difference 
approximations to derivatives. Moreover, it may be difficult in practice to 
determine a step size h for which near-optimal accuracy can be attained. 
This can cause significant problems, for example, in multivariate floating point 
Newton methods. 

For complicated functions, algebraic computation of the derivatives by hand 
is also impractical. One possible alternative is to compute the derivatives 
with symbolic manipulation systems such as Mathematica, Maple, or Reduce. 
These systems have facilities for output of the derivatives as statements in 
common compiled programming languages. However, such systems are often 
not able to adequately simplify the expressions for the derivatives, resulting in 
expressions for derivatives that can be many times as long as the expressions 
for the function itself. This “expression swell” not only can result in inefficient 
evaluation, but also can cause roundoff error to be a problem, even though 
there is no truncation error. 

A third alternative is automatic differentiation, also called “computational 
differentiation.” In this scheme, there is no truncation (method) error and the 
expression for the function is not symbolically manipulated, yet the user only 
need supply the expression for the function itself. The technique, increasingly 
used during the two decades prior to composition of this book, is based upon 
defining an arithmetic on composite objects, the components of which repre- 
sent function and derivative values. The rules of this arithmetic are based on 
the elementary rules of differentiation learned in calculus, in particular, on 
the chain rule. 


6.2.1 The Forward Mode 


In the “forward mode” of automatic differentiation, the derivative or deriva- 
tives are computed at the same time as the function. For example, if the func- 
tion and the first k derivatives are desired, then the arithmetic will operate 
on objects of the form 


uy = (u,u us. ulh)), (6.11) 


Addition of such objects comes from the calculus rule “the derivative of a sum 
is the sum of the derivatives,” that is, 


uy toy = (uto,u tou,u" +0, ul) +), (6.12) 
In other words, the j-th component of wy + vv is the j-th component of uv 
J 


plus the j-th component of vy, for 1 < 7 < k. Subtraction is defined similarly, 
while products uyvy are defined such that the first component of uyvy is 
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the first component of wy times the first component of vv, etc., as follows: 


k 
uyvuy = (0 wut uv’, uly + Qu’v! + uv", ast os & uti) : 
j=0 
(6.13) 


Rules for applying functions such as “exp,” “sin,” and “cos” to such objects 
are similarly defined. For example, 


sin(uy) = (sin(u), u’ cos(u), — sin(u)(u’)? + cos(u)u”,- ++). (6.14) 


The differentiation object corresponding to a particular value a of the inde- 
pendent variable x is of the form 


ty = (a,1,0,---0). 


Example 6.3 
Suppose the context requires us to have values of the function, of the first 
derivative, and of the second derivative for the function 


f(x) = xsin(x) - 1, 


where we want function and derivative values at x = 7/4. What steps would 
the computer do to complete the automatic differentiation? 

The computer would first resolve f into a sequence of operations (some- 
times called a code list, tape, or decomposition into elementary operations). 
If we associate the independent variable x with the variable v, and the 7-th 
intermediate result with v;,1, a sequence of operations for f can be? 


vy2 — sin(vv1) 
UVv3 — UV1UV2 (6.15) 
vv — vy3—1 


We now illustrate with 4-digit decimal arithmetic, with rounding to nearest. 
We first set 
vy, — (m/4,1,0) = (0.7854, 1, 0). 


Second, we use (6.14) to obtain 
vy <— sin((0.7854, 1,0)) 


i.e. (sin(0.7854), 1 x cos(0.7854), — sin(0.7854) x (17) + cos(0.7854) x 0) 
~ (0.7071,0.7071, —0.7071). 


r 


2We say “a sequence of operations for f can be,” rather than “the sequence of operations 
for f is,” because, in general, decompositions for a particular expression are not unique. 
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Third, we use (6.13) to obtain 


vy3 — (0.7854, 1,0) (0.7071, 0.7071, —0.7071) 
i.e. (0.7854 x 0.7071, 1 x 0.7071 + 0.7854 x 0.7071, 
0 x 0.7071 + 2 x 1 x 0.7071 + 0.7854 x (—0.7071)) 
~ (0.5554, 1.263, 0.8589) 


Finally, the second derivative object corresponding to the constant 1 is (1,0, 0), 
so we apply formula (6.12) to obtain 


vy4 — (0.5554, 1.263, 0.8589) — (1,0, 0) 
~ (0.4446, 1.263, .08589). 


Comparing, we have 


f(m/4) = (r/4sin(7/4) — 1 = —0.4446, 
f(x) = xcos(x) +sin(z) so f’(m/4) = 1.262, 
f" (x) = —axsin(x) + 2cos(x) so f”(m/4) © 0.8589, 


where the above values were computed to 16 digits, then rounded to four 
digits. This illustrates the validity of automatic differentiation.® 


6.2.2. The Reverse Mode 


The reverse mode of automatic differentiation, when used to compute the 
gradient of a function f of n variables, can be more efficient than the forward 
mode. In particular, when the forward mode (or for that matter, when finite 
differences or when symbolic derivatives) is used, the number of operations 
required to compute the gradient is proportional to n times the number of 
operations to compute the function. In contrast, when the reverse mode is 
used, it can be proven that the number of operations required to compute the 
the gradient VF (which has n components) is bounded by 5 times the number 
of operations required to evaluate the f itself, regardless of n. (However, a 
quantity of numbers proportional to the number of operations required to 
evaluate f needs to be stored when the reverse mode is used.) 

So, how does the reverse mode work? We can think of the reverse mode 
as forming a system of equations relating the derivatives of the intermediate 
variables in the computation through the chain rule, then solving the system 
of equations for the derivative of the independent variable. Suppose we have 


3The discrepancy between the values 1.263 and 1.262 for f’(7/4) is due to the fact that 
rounding to four digits was done after each operation in the automatic differentiation. If the 
expression for f’ were first symbolically derived, then evaluated with four digit rounding 
(rather than exactly, then rounding), then a similar error would occur. 
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a code list such as (6.15), giving the sequence of instructions for evaluating a 
function f. For example, one such operation could be 


Up = Ug + Ur, 


where v, is the value to be computed, while v, and vu, have previously been 
computed. Then, computing f’ is equivalent to computing v4,, where vyz 
corresponds to the value of f. (That is, vay is the dependent variable, generally 
the result of the last operation in the computation of the expression for f.) 
We form a sparse linear system with an equation for each operation in the 
code list, whose variables are v;,, 1 < k < M. For example, the equation 
corresponding to an addition vp = vg + v, would be 


Ug tu, — vp = 0, 
while the equation corresponding to a product vp = vgv, would be 
UpUy + qv, —v, =0, 


where the values of the intermediate quantities vg and v,; have been previously 
computed and stored from an evaluation of f. Likewise, if the operation were 
Up = sin(v,), then the equation would be 


cos(Uq)¥; — Up = 0, 


while if the operation were addition of a constant, vp = vg +, then the 
equation would be 
V4 - Up =0. 


If there is a single independent variable and the derivative is with respect 
to this variable, then the first equation would be 


i 
v, = 1. 


We illustrate with the f for Example 6.3. If the code list is as in (6.15), 
then the system of equations will be 


1 00 0\ /v 1 
cos(v;) -1 0 0 vg | _ | 0 

v2 UL —1 0 U5 = 0 (6.16) 
0 


0 0 1-1 U4 


If vy = « = 7/4 as in Example 6.3, then this system, filled using four-digit 
arithmetic, is 


1 0 0 0 

0.7071 -1 0 O}] »% 

0.7071 0.7854 -1 0 (6.17) 
= 


0 0 1 


| 
coOoOrF 
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The reverse mode consists simply of solving this system with forward substi- 
tution. This system has solution 


vi 1.0000 
vs | _ | 0.7071 
vs | ~ | 1.2625 
v, 1.2625 


Thus f’(a/4) = vi, & 1.2625, which corresponds to what we obtained with 
the forward mode. 


Example 6.4 
Suppose f (21,22) = «7 — 73. Compute 


oi 
V f (£1, £2) _ (34. 54) 


at (a1, 72) = (1,2) using the reverse mode. 
Solution: A code list for this function can be 


U= Ty 
U2 = v9 
U3, ve 
U4 = vs 


The reverse mode system of equations for computing df /dx; is thus 


1 0 0 0 0\ fe 
Oh 20> oe “Ol | we 
2, 0 -1 O O vs | = ei, (6.18) 
0 2. 0-1 O} | x 
ee es 


where e; is the vector whose i-th component is 1 and all of whose other 
components are 0. When 2; = 1 and x2 = 2, we have 


10 0 0 O\ /w 
Qi 66s « Aor. a0: || we 
2 0-1 0 0 us | =e 
04 0-1 Of | 
OO 1% 0 oa hat 


Now, df /dx; can be computed by ignoring the row and column corresponding 
to v5, while df /dx2 can be computed by ignoring the row and column corre- 
sponding to vi. We thus obtain Of /Ox, = 2 and Of /Ox2 = —4 (Exercise 9). 
[ 
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In fact, a directional derivative can be computed in the reverse mode with 
the same amount of work it takes to compute a single partial derivative. For 
example, the directional derivative of f(%1,%2) at (%1,%2) = (1,2) in the 
direction of u = (1/2, 1/2)" can be obtained by solving the linear system 


1 0). OS Oe 20.7 OF 1/V2 

01 0 0 OO] [e% 1/V2 

2° 01=k 20-0.) |e - | =-))- 0 (6.19) 
04 0-1 Of} |» 0 

00 1-1-1) \u¥ 0 


/ 
for vs. 


6.2.3. Implementation of Automatic Differentiation 


Automatic differentiation can be incorporated directly into the program- 
ming language compiler, or the technology of operator overloading (available 
in object-oriented languages) can be used. A number of packages are available 
to do automatic differentiation. The best packages (such as ADOLC, for differ- 
entiating “C” programs and ADIFOR, for differentiating Fortran programs) can 
accept the definition of the function f in the form of a fairly generally written 
computer program. Some of them (such as ADIFOR) produce a new program 
that will evaluate both the function and derivatives, while others (such as 
ADOLC) produce a code list or “tape” from the original program, then operate 
on the code list to produce the derivatives. The monograph [36] contains a 
comprehensive overview of theory and implementation of both the forward 
and backward modes. 


6.3. Numerical Integration 


The problem throughout the remainder of this chapter is determining ac- 


curate methods for approximating the integral th f(x)dx. Approximating 
integrals is called numerical integration or quadrature. 


6.3.1 Introduction 


First, we define some useful notation. Let 


b 
I(f) = / f(x)de. (6.20) 
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We will see that most quadrature formulas reduce to the form 


Q(f) = (b- a) aj f(e;), (6.21) 
j=0 
where the ag, Q1, ..., @m are called weights and the xo, £1, ..., Lm are the 


sample or nodal points. We have 


I(f) = Q(f) + E(f), (6.22) 


where E(f) is the error in the quadrature formula. 


REMARK 6.1 To obtain E(f) = 0 for f(x) = a constant, we require 


> ay= 1. 
j=0 


In the next few sections, we will study quadrature formulas for which the 
error E(f) is proportional to H9°™, where g(m) > 1 depends on the weights 
and sample points and H = b—a. For fixed m, which is generally desirable 
since the weights and points may be either difficult to compute or the weights 
may become large in magnitude,* we need a way to make E(f) go to zero. This 
is accomplished by dividing the interval [a, b] into many smaller subintervals. 
This procedure is called composite numerical integration. 

Consider Figure 6.2. We divide [a, b] into N subintervals each of length h. 


a a, ag + QAN-1 b 
ao an 


FIGURE 6.2: Illustration of composite numerical integration. 


4leading to rounding errors 
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Thus, h = (b—a)/N and a, =a+vh for v =0,1,...,N. Then, 


N=-1 paygs N-1 
HA =f fede = Ya f) + EAI (6.23) 
v=0 % ov v=0 
where Q,(f) = (av41 — ay) Ds aj; f(eyj) and E,(f) = O (h9™). 
Thus, + 
N-1 


S” Qf) = If) + O(NAI™) = I(f) + O(hH™ 1)  I(f) ash 0. 


v=0 


REMARK 6.2 In practice, the points and weights, 7;, aj, 7 = 0,1,...,m, 
are generally given for some interval, say [—1,1]. Then 


/ f(a)da & 25° a, f(x;). 
-1 7a 


However, given these points, the points for any interval [a,,a,+1] can be 
determined, and hence 


can be calculated. To see how, first note that 


Qav4+1 1 
/ Ai i: p(t) Bey 
y 1 2 2 


Ze = 


where 
zh+ ay4i1+ ay 20 — Gy41 — Gy 
L = —_ and zg = — 
2 h 
Thus, 
fete ue ejht+ aya, + ay 
[O° tern Quin = ayes (At) a, 
apy j=0 
m ha; +2a+(2v+1)h 
= >> f pee ay. 
j=0 
Therefore, 


m 


N-1 
If) = SI RY. a5 f@v7) +O, (6.24) 
v=0 Jj 


0 
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where w),; = (ha; + 2a+ (2v + 1)h)/2 forO0<j<m,0<v<N-1. 


In the next two sections, we will study two popular ways of determining 
accurate points and weights, x; and a; for 7 = 0,1,...,m, to be used in 
formulas (6.21) and (6.24). 


6.3.2. Newton-Cotes Formulas 


Consider 
b m 
If) = / f(a)dx © Q(f) = (b— a) Say f(a). 
a j=0 


We now study a standard procedure for determining the aj;’s and 2,’s to 
obtain approximations called the Newton—Cotes Formulas. (Of course, these 
weights and points are usually implemented in the composite formula (6.24).) 
There are two ways to derive the Newton-Cotes Formulas; each is useful to 
understand. For either derivation, the points 7;, 7 = 0,1,...,m are taken 
to be equally spaced on the interval [a,b]. There are two ways to accomplish 
this. 


DEFINITION 6.1 = The (m+1 point) open Newton—Cotes formulas have 

points x; = 9 + jh, j = 0,1,2,...,m, where h = (b— a)/(m+ 2) and 
Lo =a+t+h. The (n+1 point) closed Newton—Cotes formulas have points 
Lj; =xto+jh, 7 =0,1,...,m, where h = (b—a)/(m) and xp =a. 


REMARK 6.3 The A in Definition 6.1 is different from the h used in the 
composite numerical integration method. 0 


Example 6.5 

Suppose that a = 0, b= 1, and m = 3. Then, the open points are rp = 0.2, 
x, = 0.4, x2 = 0.6, and v3 = 0.8, while the closed points are x) = 0, 1 = 1/3, 
2 = 2/3, and x3 = 1. 


We now consider two ways for deriving the weights a;. 


6.3.2.1 Derivation 1 (Require (6.21) to be exact for polynomials of 
degree < m) 


We require 


J(p) = Q(p) for polynomials p(x) up to degree m. (6.25) 


Numerical Differentiation and Integration 337 


Consider 
E(f) = J(N) -Q(N) 
= J(f) — J(p) + Q(p) — Q(f) for any p € Pm 
=IG <P) +CG= 1) 
b m 
: / (F(x) — p(x) de + (b— a) S04 (p(s) — f(as)). 
a j=0 
Thus, 
E(F)| < ((b— a) + (b— a) Y> Jay) max, |F(2) — pl) 
j=0 lick 


for any p € P,,. Hence, if J(p) = Q(p) and f is smooth, our polynomial 
approximation theory results? indicate that the error |E(f)| may be quite 
small. 

We now have the question: Does condition (6.25) uniquely determine the 
weights a;? Consider first the closed Newton—Cotes formulas for m = 3 with 
a=0Oand b=1. 

Then, 


Q(f) = aof(0) + a1 f(1/3) + a2 f(2/3) + asf) 


is exact for f(x) =1, x, x”, x3. This produces the linear system 


o—, 
o— 
Q 
8 
II 
hb 
| 
ie) 
=) 
2) 
2) 
1) 


+r 3, 


= ap(0) +4 (5) + a2 () ao 
[ wdr= > =a9(0)+a (4) ee (3) pany 
c adz=> =ap(0)+a G) ce (3) ae 


Solving we obtain ap = 1/8, ay = 3/8, ag = 3/8, a3 = 1/8. 
In general, we obtain the linear system Aa = b where 


a 
8 
l| 


1 
fe 
0 


Ble wl wle 


(ee an eee as 1 

to £1 © °°: Lm Qy 1/2 
Pe ee eee , a=|%], and b= 1/3 

“Lo LP xy +++ a Qm 1/(m+ 1) 


5e.g., the error terms for Lagrange interpolation or best uniform approximation 
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for the interval [0, 1]. Note that A is the transpose of the Vandermonde matrix 
that arises as the matrix for the system of equations for the coefficients of the 
interpolating polynomial at the points x;. (See page 211.) Thus, the matrix 
A is nonsingular, because x; 4 xe for j # &. 


6.3.2.2 Derivation 2 (Require (6.21) to be exact for the Lagrange 
interpolant to f(z)) 


Recall Lagrange polynomial interpolation 


f(x) = p(2) + R(x), 


where 
m cee Te 
P(x) = S> f(a;)Lj(z) and R(x) = ae ere YT - 
j=0 i=0 


Recall that f(a;) = p(x;) for i = 0,1,...,m, where the x; are either the open 
or closed points. Consider 


Define 


m b b 
fo= a,f(z;), where aj;= | Lj;(z)de and E(f)=] R(a)dc. 


Example 6.6 
Consider the closed Newton—Cotes formula on [0,1] for m = 3. In this case, 
Xo = 0, 41 = 1/3, ro = 2/3, x3 = 1. Then, 


a= [ totayae is ih te 


To — £1)(Xo _ £2)(Xo _ r3) 


. 11 2 1 
== a) 7 — =)\dr = =. 
5 fe ia aaa ee 
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Similarly, a, = 3/8, ag = 3/8, a3 = 1/8. Thus 
3 2 1 
[ fee =f(0) + sf(5) tar (3) + Gr. 


As an example of application of this formula, consider 
1 
~ 1.71854. 


1 
, e*dz =e -1% 10 3 41/3 | 3 2/3 ers 
0 8 8 8 
(Of course, composite numerical integration can be used with this 4-point 


Newton Cotes method to improve the approximation.) 


6.3.2.3. The Error Term 


We now consider F(f) in greater detail. Recall that 
bm (m+1) 
Porte), 


py= f r@ae= | [tere =a 


Case 1 [| (a —2;) does or does not change sign on [a, 6] 


i=0 
Then, 
Tea frre 
P)l= [I (m+ 1)! sr ere a 
fea 
< (a — a;)| da, 
(m+)! 7 II 


where 
eee = = Paleo) 


Changing variables, 


z= 5, a=, (a — a) = (z — z)(b—a) 
gives 
WE hea fy = some Fl Pie ess 
(| < ne — ayn? f [l-- =e. 
Thus, 
m+2 ax fetes 
BUD) < HB 4 (6.26) 
where 
1 |m 
H=(b—a) and Bing = f [[e-« dz. 
i=0 


340 Classical and Modern Numerical Analysis 


Case 2 [| (a —2;) does not change sign on [a, }] (restrictive) 
i=0 


Then, 
fOrny 


for some &',a < €’ < b, by the ee reen mean-value theorem for integrals. 
Thus, 


fim De ym 
By) =o ayn? fT]te— sods 
os for+0(e" 
E(f)= Fe ay Beh (6.27) 
where at 
Bm = | Ie — 2;)dz 
Example 6.7 


Consider m = 1, rp = a, 21 = 5, and closed Newton—Cotes. Then 


~ . b 1 
= dose) | L;(x)dx 4 [e a) (x 1) aw. 


Thus, 


(the trapezoidal rule). Also, 


a) 


b 
BA) = OS [a - a\(e bar, 


since (« — a)(a—b) <Ofora<a<b. 
Thus, 


Hence, 
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Example 6.8 

Consider m = 0, xo = (a+b)/2, and open Newton—Cotes. Using the Lagrange 
polynomial procedure, the error can be shown to be proportional to H? for 
this rule. However, a better result can be obtained by expanding f(x) in a 
Taylor series about x9: 


tore (292) +1 (92) (852) renee 


for some &*, a < &* < b, where 


Q(f) = (b-a)f (: a *) (the midpoint rule). (6.28) 
Therefore, for this rule, 
aY 1 3 fll (ex 
Ef) = PPE) 


REMARK 6.4 _ As indicated in Example 6.8 above, the error analysis 
starting on page 339 can be improved for certain Newton—Cotes formulas. 
Indeed, the following results can be derived: 


(i) For closed Newton—Cotes formulas: 
(a) (m even) 


Hmts ¢(m+2) (€ 


Bp y= 2 [et =1)... (= mat 


mm+3(m + 2)! 


m+2 ¢(m+1) m 
n= | t(t—1)...(t— mat. 


mm+2(m +1)! 
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(ii) For open Newton—Cotes formulas: 


(a) (m even) 


Hm+3 plm+2)(¢) , i 2(t—1)...(t—m)dt. 


BO) = Ga pa + Dl 


m+2 ¢(m+1) m+1 
BI) = oo t(t—1)...(t — m)dt. 


for some €, a < € < bin all the above formulas. 


REMARK 6.5 The degree of precision of a quadrature formula is defined 
as the positive integer n satisfying E(p,) = 0 for k = 0,1,...,n, where px 
is a polynomial of degree k, but E(pni1) 4 0 for some polynomial of degree 
(n +1). By the previous remark, the degree of precision of the Newton-Cotes 


formulas is (m+ 1) if m is even and m if m is odd. 


REMARK 6.6 Some common open and closed Newton-Cotes formulas 


are the following: 
(i) Closed Newton-Cotes 
aaa rule) 
“f Hla)de = “S*[ f(a) + £0) — GHP P"(6. 


(Simpson’s Hae 


“- fla)de = "=" | pay + ar (S*) + 10] - Saas. 


(ii) Open Newton-Cotes 


m =0: (midpoint rule) 
b " 3 
| flayae = (6- a) (2) + Fy" 


(two-point en rule) 


fren Es (858) +4 (or2(52)) 
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REMARK 6.7 _ It is possible to find f € C™[a, b] such that E(f) = J(f)— 
Q(f) does not go to zero as m — oo. (See [22].) Thus, composite methods 
are required, in general, if Newton—Cotes formulas are used. (You will show 
that E(f) — 0 as h — 0 for composite quadrature even for f € C[a, }]; see 
Exercise 11 on page 377.) 


6.3.3 Gaussian Quadrature 


We have just seen that Newton-Cotes formulas can be derived by 


(a) choosing the sample (or nodal) points x;, 0 < i < m, equidistant on 
[a, 6], and 


(b) choosing the weights a;,0 <i < m, so that numerical quadrature is 
exact for the highest degree polynomial possible. 


(We saw that, using (m+ 1) points, the degree of precision for Newton-Cotes 
formulas is m if m is odd and (m + 1) if m is even.) 


In Gaussian quadrature, the points and weights x;, a;, 0 < i < m, are 
both chosen so that the quadrature formula is exact for the highest degree 
polynomial possible. This results in the degree of precision for (m + 1)-point 
Gaussian quadrature being 2m +1. Consider the following example. 


Example 6.9 


Take J(f) = ie f(x)dz and m = 1. By a change of variables, we convert to 
the interval [—1, 1] as follows: 


b 
Jf) = i f(a)dex (6.29) 


-[ (CS) (eo) de (6.30) 


= is g(z)dz = J(g) (where ia ——) 


We want to find ao, ai, 21, and zg such that Q(g) = J(g) for the highest 
degree polynomial possible. Letting g(z) = 1, g(z) = z, g(z) = 27, and 
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3, we obtain the following nonlinear system: 


1 
/ 1ldz=2=ag +a 
-1 


1 
‘f zdz=0=a0z% + 121 


g(z) =z 


er 
R 
bo 
Qu 
N 
| 
wile 
\| 
R 
hp 


aoz% + a 


i 
/ 2dz=0 = agze + az}. 


Solving, we obtain ag = a, 1, 29 = —1/V3, 2 = 1/V3, which are the 
2-point Gaussian points and weights. Hence, 


[feos Jet loos a) bong (HOD Het), 


2 2 
U 


We now consider more sophisticated ways to determine these points and 
weights, as well as error estimates. Let 


b 
If) = 7 f(a)o(a)de, (6.31) 


where p(x) is a real, positive, piecewise continuous weight function on (a,b), 


e.g. p(x) = x on (0,1). 
REMARK 6.8 We are generalizing the problem from p(x) = 1. 
We consider quadrature formulas of the type 
Q(f) = (b= 4) Jax f (ae) (6.32) 
k=0 
with sample points x,, 0 < k < mand weights az, O< k <m. 


DEFINITION 6.2 We call Q(f) a Gaussian quadrature formula if 


b m 
i: a! p(x)dx = (b—a) si Apri, 


k=0 


for j = 0,1,2,...,2m+1. That is, the quadrature formula has degree of 
precision 2m +1, i.e., it is exact for polynomials of degree < 2m+1. 
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To find the points and weights, we introduce the inner product 


b 
(f.9) = | ole) F(oate)ae 


DEFINITION 6.3 — The set of polynomials {p;(x)}%29 ts called orthogonal 
on [a,b] with respect to weight function p(x) if 


b 
[ele yviteypy(a)ae = 0 
whenever iF j, 1.€., if (pi,pj) =0 fori #7. 


Recall the Gram—Schmidt orthogonalization process to find the set {pi} 
of orthogonal polynomials: Let 


j-l 
1 : ; 
po(x) == 1, and p;(2) = x — l ec® Papal) for j _ eee (6.33) 
k=0 Pk 
where ||px ||? = (pe, Pr): 
REMARK 6.9 _ p,(zx) is of degree j. 


We will see that the Gaussian quadrature points are the zeros of pm4i(2). 
First, we show that the zeros of {x;}/> of pm+i() are distinct and lie on 
(a, b]. 


PROPOSITION 6.1 

Let {pi}%o be the sequence (6.33) of orthogonal polynomials. If f is any 
continuous function on |a, b] that is orthogonal to po, pi,.--,;Pr—1, then f must 
either vanish identically or change sign at least k times in (a,b). 


PROOF Since po(x) = 1, ‘le f(x)p(x)dx = 0. Since p(x) > 0 on (a,b), 
if f(x) A 0, then a must es A at least once in (a,b). Suppose 
that f changes sign fewer than & times, say at r1 < re < +++ < 15, 85 <k. 
Then, on each interval (a,71), (71,12), .--,(7s,0), f does not change sign, but 
has opposite signs on adjacent subintervals. Consider p(x) = (# — r1)(a — 
rg)...(a—rs). This polynomial p shares with f the property that it changes 
Hen ies 1 <i <_s, and has opposite signs on adjacent intervals. Hence, 
th f(« (a)da #0. Now . : can be written as a linear combination po, 


D1, -+-;Ds,8<k. But fe f(x)p(x)p;(a)dx = 0 fori = 0,1,...,4 —1. Thus, 
ie f(x)p(x)p(a)dx = 0, a Lee Therefore, f must vanish at least k 
times in (a, b). 
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COROLLARY 6.1 
Let {p}iso be the sequence of orthogonal polynomials given by (6.33). The 
roots of p(x) are simple and lie in (a, b). 


PROOF _ The polynomial p; is orthogonal to po, pi, ..., pi-1. Thus by, 
the proposition, we conclude that p;() must vanish at least 7 times in (a, b), 
and hence identically 7 times, since p; is of degree 7. 


REMARK 6.10 By Corollary 6.1, the zeros {x;}""9 of Pm41() are dis- 
tinct and lie in (a,b). 


Now consider ibs p(x) f(x)dx. If f(a) is a polynomial of degree m, we can 
write f(x) as 


LT Xj 


S~ f(a,)L;(x), where L;(x) =] 
ay oi 


Ly — Li 


(This is the Lagrange form of interpolating polynomial for any m+ 1 distinct 
points vo, 21, ...,%m on (a,b).) Thus, ibe p(x) f(a)dx = (b— a) YO a; f(z;) 


j=0 
will be exact for polynomials of degree < m provided, 


b HOO oh b 
a;= — / p(x) II ai dx = —/ p(x) L;(x)dx. (6.34) 


a mG Li Uy 
iAj 


We now show that the Gaussian quadrature points {x;}7",) are the zeros of 
Pm+1(z) and the weights {a,j} satisfy (6.34). 


THEOREM 6.1 

Suppose that the (m+ 1)-point quadrature formula is exact for polynomi- 
als of degree < m. A quadrature formula is a Gaussian quadrature formula 
(exact for polynomials of degree < 2m +1) if and only if the sample points 
X0,L1,-+-,;Lm are the zeros of the orthogonal polynomial pm+i(x), and the 
coefficients Oo,1,-.-,Qm can be expressed as in (6.34). 


PROOF Let {2;}”, be the zeros of pm+i(x). Let y(x) be a polynomial 
of degree < 2m+1. Then, p(x) = q(x)pm4i(x) + r(x) (by dividing y by 
Pm+1), Where the degrees of g and r are each < m. Therefore, (pm+i,q) = 0, 
so 


b b m 
[ eleiotw)ae =f pla)r(a)ae = (b= a) aura), 


@ i=0 
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since r is of degree < m. Hence, 


b m 
J ele\ele)ae = (b= a) aipla), 
x i=0 
since pm4i(vi) = 0 for i = 0,1,...,m. Therefore, the integration formula is 


exact for v(x). 
Conversely, let Q(f) be exact for f € Pam4i1. In particular, choose f(x) = 


p;(x) T[ (« — a;) for j < m. Then, 
i=0 


= 


b m 
[ ee) Fodz = (6-0) aaflzs) = 0. 


i=0 
We thus conclude that for 7 < m, 


b 
[ eteyws@) [[@ - 2i)ae = 0, 
1.e. 


(Tle -0.0) =0 for 7=0,1,...,m. 
i=0 


Thus, [[ (x — 2;) is orthogonal to p;(x), 0 < 7 < m. By uniqueness of the 
i=0 
m 
Gram-Schmidt process, [[(# — 2;) must be a scalar multiple of pm4i(2). 
i=0 
Hence, x;, 0 < i < m, are the roots of py41(«). Finally, since the integration 
formula is exact for polynomials of degree < m, (6.34) is satisfied. 


REMARK 6.11 An efficient algebraic procedure for calculating the a; 
and x;, 0 < 7 < mcan be described. Notice that 


b m 
[ ee) foae = 6-0) Y aisles) 
a i=0 
for f(z) = 1, f(z) = 2, f(z) = 27, ..., f(z) = 27", and set 


b 
mj = = - | p(a)a? dex. 


ao +r ay Am =™o, 

agro +TAX, Tees FT AmL%m =™ 1, 

axe oh ay xt SYP seeteo Te OnE, = me, (6.35) 
2 1 2 1 a 

apr t* + agai iGo) = Waa. 
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m+1 J 
Define p(x) = (w—2x0)(@—21)...(@-&m) = YO co; x7, with cn41 = 1. Using 
j=0 


the first (m+ 2) equations of (6.35), we have 


Co (a0 +r Ay wae Qm) = Co mo 
C1 (aoX0 rT Ary wes AmLm) = Cl my 
(6.36) 
m+1 mt+1 1 Xs 
Cm+1 (aoxy” + anzt eee $F Omt™ 1) = Omir Mm41 
Hence, 
Moco + 1c, +... Mm4+1Cm+1 
+1 
= ao(co + cito +++: +¢m41z5'"-) 
1 
ar(co + e1a1 +++ + emiizt't’) 
1 
+ Om (Co + C1%m ++ +++ Cm4i0m**) 
= agp(ro) + a1p(%1) +++: + Amp(tm) 
=0, since p(x;) = 0 for0 <j <m. 
Thus, moco +771¢) +--+ +MmCm = —Mm+1. Considering the second through 
(m + 3)rd equations in (6.35) in the same manner yields: 
myco + Mec1 +--+ + Mmticm = —Mm+2 
Continuing in this manner, the following system is obtained: 
MoCo TF myCy wee MmCm = —Mm+1 
™1Co mM2C1 wee Mm+1em = —Mm+2 
m2Co M3Cy1 wee Mmt2Cm = —Mm+13 (6.37) 
MmCo + Mm+41C1 cee MomCm = —Mam+41 


Also, since Cm41 = 1, the polynomial p() is completely determined when 


(6.37) is solved for co, C1, ..., Cm. Hence, if the roots {#;}”, of p(x), which 
are distinct and lie on (a, 6), are calculated, then the weights a;, 0 <j <m, 
can be determined from (6.35). 


6.3.3.1 Examples of Gauss-Quadrature Rules 
Suppose that p(x) = 1 (standard Gauss—Legendre quadrature rules). First, 


[ seoae= 2 fs (*) a= fi o(2)t: 


x2 S- a59(2;), 
j=0 
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where g(z) = 54f((z(b— a) +. a+ )/2). Thus, we need only consider the 
interval [—1,1], since if a; and z;, 0 < 7 < n, are the weights and points 
for [-1,1], then, a; and (z;()-—a)+a+6)/2, 0 < 7 < m, are the weights 
and points for [a,b]. The weights and points for the first few Gauss—Legendre 
quadrature rules, where the a; do not include the factor of 2 in the preceding 
formula, appear in Table 6.1. 


TABLE 6.1: Weights and sample points: Gauss—Legendre quadrature 
1 point (m = 0 
2 point (m = 1) 


3 point (m = 2) 


4 point (m = 3) 


6.3.3.2 Error Formula for Gaussian Quadrature 


To derive an error formula for Gaussian quadrature, we start with the 
following lemma. 


LEMMA 6.1 
Let %0,%1,..-,%m be distinct points in [a,b] and let f € C?™*?[a, b]. If p is 
the Hermite interpolating polynomial of a most degree 2m +1 such that 


p(x) = f(xs), 0 (aa) =f (%) fori = 0,1,...,m, 
then 


(2m+2) r ™m 
f(x) — p(x) = eee [[@ — 2;)* where €(x) € (a,b). 


1=0 


PROOF Lemma 6.1 is a restatement of Theorem 4.8, the proof of which 
appears on page 221. 


We now have 


THEOREM 6.2 
Let p(x) > 0 be a weight function defined on [a,b] and let {pp(x)} P29 be 
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the associated sequence of orthogonal polynomials generated by (6.33). Let 
{t;,a:}@, be the points and weights for Gauss quadrature. Then, if f € 
C2™+2\q, bj, we have 


b (2m+2) b 
[ (2) Heyaer—(0- o Dash 0) =F [ olwhsa(wide, (6.38) 


where € € (a,b). 


PROOF Let hom+i(x) be the Hermite interpolating polynomial of degree 

at most 2m +1 that interpolates the function values and slopes of f(x) at 
@j,0< 7 <m,ie., f(xj) = hemsi(a;), f'(&j) = Rom4i(2j), OS 7 < m. By 
lemma 6.1, 


f(@) = hamsa(e) = (2m + 2)! 


for some €(x) € (a,b). Multiplying both sides by p(a) and integrating, 


b 
: fee") le) vas ae, 


b 
i (f(2) — ham (2))olw)de = Bay 


since p?,,,(x) = (@ — x0)?(x — #1)*...(@ — &m)? and the leading coefficient 
of pm+1(2) is 1. Thus, 


b 2m4+2 b 
[5 Fla)pla)de~ [ham sala)ple)de = FE Up avin ala) 


by the mean value theorem for integrals. Now, 
b 
: pP(2)ham4i(x)dx = (b—a Se ahaa xj) -—a avsl Lj) 
a j=0 


(since the formula is exact for polynomials of degree at most 2m +1). Thus, 


(2m+4+2) b 
BU) =I) - 0) = FF [ pleyrialee 


REMARK 6.12 Consider 


* pa)? (2)da = i “oo) [Tle — 2a. 
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Let 


Then, 


b Ali m 
J ehedninsaledae = (6 a)?** [p(2(— a) +0) [] (2-2) 


1=0 
Hence, 

fer © m A 

E(f) = Car ae 

where 

is 1 m 

H=(b-a) and 6B, = | p(z(b— a) +a) [[& — 2)" dz. 
2 i=0 


REMARK 6.13 By Theorem 6.2, the degree of precision of (m + 1)- 
point Gauss quadrature is 2m+ 1. For example, 4-point Gauss quadrature 
integrates polynomials of degree less than or equal to 7 exactly. 


REMARK 6.14 Suppose that p(2) = 1. For composite Gauss quadrature, 


[ seo = Daf, 5(EED atm) a 


N-1 m 
l)h 
<0 Soot (SEM osm), 
v=0 j=0 
m 
where )) a; = 1 and aj;,z; are Gauss quadrature weights and points for 


j=0 
i 
[—1,1] andh = a Furthermore, from the derivation of composite quadra- 


ture (page 335) and Theorem 6.2, the error in composite (m+ 1)-point Gauss 
quadrature is proportional to h?™*? for f € C?™*?{a, b]. 


There is one additional aspect about Gauss quadrature which is important. 
That is , aj > 0 for 7 = 0,1,2,...,m for any m. This implies, since 
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that 


m b 

Dal =f oleae, 

j=0 é 
which is constant for all m. This makes Gauss quadrature stable in the pres- 
ence of rounding errors, in contrast to Newton—Cotes methods. In addition, 
as you show in Exercise 12 on page 378, Qm(f) —~ J(£) as m — oo for any 
f € Cla, b], where Q,,(f) represents m-point Gaussian quadrature with weight 
function p. Hence, a composite procedure is not necessary for convergence of 
Gauss quadrature for f € Cla,b]. In fact, in some applications, 64-point or 
higher Gauss quadrature rules are used. Nonetheless, since accurate values 
of weights and sample points may be difficult to obtain for very high-order 
Gaussian quadrature, composite Gauss quadrature is useful. In contrast, as 
mentioned earlier, E(f) may not go to zero for Newton—Cotes formulas® as 
m — co, even for f € C™{a,b]. Thus, composite quadrature is essential for 
Newton—Cotes rules. (Recall that, in Exercise 11 on page 377, you show that 
E(f) — 0 as h > 0 for general composite quadrature for f € Cla, b], be it 
Gaussian or Newton—Cotes.) 


THEOREM 6.3 
The weights aj, 7 =0,1,...,m, in Gaussian quadrature are all positive for 
any ™m. 


PROOF = Recall that (m+ 1)-point Gauss quadrature is exact for polyno- 
mials of degree less than or equal to 2m + 1. Let 


q;(«) = feet =][@-2,). 


Fj 
Since qj € Pom, 


b m n 
[ eleiaila)ae = Y° aiay(2s)(b - 0) = (b~ aay [] (0; - 24)? 


1=0 


iAj 
= (b— a)a;(pi,44 (25), 
since q;(x;) =O ifi #7. Thus, 
b 
[eevee @)/(e ~ 2 Pate 
Qa; = —e _____._—__ > 0 
. (Din41(&5))?(b — a) 
for each 7. 


Swhich have the advantage of uniformly spaced sample points 
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6.3.4 Romberg Integration 


Romberg integration is an efficient and popular numerical integration pro- 
cedure. Let 


b 
Jf) = i: fla)de, 


and let Sp(f) be some composite numerical integration rule using interval 
width h. For example, using the notation of section 6.3.1 (starting on page 333), 


N-1 N-1 ™ ; a *) 
sin= Nanere Vy SS ie 


: 2 
v=0 v=0 j=0 


= 
where h = = and x; € [-1,1] for0<j<m. 


Let En(f) = J(f) — Sn(f). Now, suppose that 


En(f) = eth + cgh? +--+ + con—gh?"? + O(h?"—"), (6.39) 
where the c;,i=0,1,...,2n — 2 are constants independent of h. Consider 
J(f) = Sr(f) + ch + coh? tere Cop—2h?k-? + OCR), so 


J(f) = Snj2(f) TC (4) + C2 @) eee 


where S;,/2 is the composite rule approximation using interval width h/2. 
Thus, 
2I(f) — J(f) = 2Snjo(f) — Sr(f) + gh? + ghP +... 


Set S\ (f) = 25h 2 — Sn(f). Then, 


If) = Sf) + hh? + Gh? + cht +..., and 
I) = Siplf) +a (F) +B) + 


Thus, 
agp) — 9) 
I(f) = sl AC Ee ee 
Set 
SPF) = (4502(1) - SPUN) /. 


Continuing this procedure, set 


akg (7) — SF) 


k+1 
Sh) = 


(6.40) 
We can write these approximations as in Table 6.2. In this table, the quantities 
in a particular column are computed from the two quantities in the preceding 
column immediately to the left and above, using (6.40). 


354 Classical and Modern Numerical Analysis 


TABLE 6.2: Illustration of Richardson 


extrapolation 
O(h 
sO = s,, 
Shyo = Sn/2 
hey = Shia 
Shs = Snjalf 


If the error expansion is of even order, which is preferable, i.e., 
En(f) = cgh? + cah* +--+ + cop—gh?*-? + O(h7*1), (6.41) 
then 
Sh) = (48nj2 — Sn)/3 


Sf) = (168K), — 9) /15 


which, in general, is 


asa) — SO) 


k+1 


(6.42) 


The computations associated with (6.42) appear in Table 6.3. 


TABLE 6.3: Illustration of even-order 
Richardson extrapolation 


REMARK 6.15 The above process is called Richardson extrapolation, 
and can be performed whenever the error expansion has the form (6.39) or 
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(6.41). Richardson extrapolation is useful in numerical integration, numerical 
solution of integral equations and initial-value problems, numerical methods 
for solving partial differential equations, and stochastic differential equations. 


REMARK 6.16 When Richardson extrapolation is applied to the com- 
posite trapezoidal rule, the numerical integration procedure is called Romberg 
integration. (However, as noted later, this extrapolation process can be ap- 
plied to all standard composite integration formulas.) 


We now consider the composite trapezoidal rule in detail, i.e., 


N-1 
SiN =h(SFl@) +510 )) +h fat wn (6.43) 
v=1 
hs b-—a 
aa) [f(a+vh)+ flat+vh+h)], where h= Tae 


v=0 
We need to show that 
E,(f) = coh? + cah* + cgh®+..., 
where the constants c;, i = 2,4,..., do not depend on h. To establish this 


result, we will use the following lemma. 


LEMMA 6.2 
(Euler-Maclaurin formula) For f € C?™*?{a, b), 


m 


[5 Hla) = =" 1p(a) + 4 I- 5 HPH[f- (B) — FON (Q 
naval 
HT Bons? g(am+2(6), 


(2m + 2)! 


where H = b—a and € is some point in [a,b]. The numbers Bo, Ba, Be, ... 
are the Bernoulli numbers of even order, which can be defined by 


a ian ee d 1 
wen} fone Sem) Loader dete} 
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PROOF Let P,(t), n = 0,1,2,... be the Bernoulli polynomials defined 
by Po(t) = 1 and 


S- oe ') P,(t) = (n+ 1)t”, (6.44) 
k=0 
Le., 
(n + 1)Ph(t) + » ie ‘) P,(t) = (n+ 1)t”. 


Po(t) = 1, 

1 
P(t) =t-5, 

1 

P(t) =? -t+ 5, 

3 1 
P;(t) = #8 — <t? + <t 
3(t) =t Bors 


We will establish first the following properties of Bernoulli polynomials: 


P!(t) = nPp_1(0), n>1. (6.45) 
P,(t+1)-— P,(é) =nt""?. (6.46) 

P,(t) = 3 () P,(0)t"-*. (6.47) 

P,(1 ie (—1)"P,, (2). (6.48) 


Proof of (6.45): We establish (6.45) by induction. Clearly (6.45) is valid for 
n= 1. Now, suppose Pi(t) = kP,-1(t) for k = 1,2,...,n—1. Differentiating 
(6.44) yields 


n-1 


k 


=0 
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since 
k n+1 n 
n+1 ( k ie Ga 
Hence, 
: n—-2 n n-1 ys 
P!(t) =nPya(t) + x C P,(t) S e . J Py_1(t) = nPy_1(t)- 


Proof of (6.46): By (6.45), 
Putt) =nP,_i(t) = n(n —1)Pr—a(t), 
P(t) = n(n —1)(n — 2)P,-3 (6), 


POW) =aln—1)...@—k4 PK. 
Thus, the Taylor series for P,(t) is 


(E-+h) = >> = P(t) >> c) Py_«(t)h®. (6.51) 
k=0 k=0 
Now, set h = 1 and use (7) = (,,",) and (6.44) to obtain 
P,t+ =>- () Py_n(t) 
k=0 
= he Prk (t) 
=P.) oe (," 2) Pr—k(t) 
v n 
= P,,(t)+ 2 (") P;(t) (letting 7 = n — k) 
=P, (+a) 


Hence, (6.46) is established. 
Proof of (6.47): Set h =t and t = 0 in (6.51) to obtain 


= ss ( 2 ) P;(0)t"4 — (setting j = n — k) 


mJ 


= ss (") P,(o)t?-4. 


wS 
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n 


Thus, P, ( i: ) Pel o)en-*, 


Proof of (6. 18): (6 .46) with t replaced with (—t) gives 
P(-t+ 1) — P,(-—t) = n(-1)"- "+ = (-1)" "(Pt + 1) — Bd). 


Writing this as 
(—1)"Pa(t+ 1) — Pa(—t) = (-1)"Palt) — Pall —t) = FQ), 


we see that F(t + 1) = F(t) for all t. Thus, F is periodic with period 1. But 
F is a polynomial, so F' must be constant. Therefore, 


—P,(1 —t) + (-1)"Pr(t) = en. 
Differentiating this expression and using (6.45) gives 
(-1)"P)(t) + Pi (1 —t) = (-1)"nP,_1(t) + nPp_i(1 — t) = 0. 
Thus, P,(1 — t) = (—1)"P,(t). 


REMARK 6.17 — Setting ¢ = 0 in (6.46) and (6.48) gives 
Py(1) = Pr(0) = (-1)"P, (0) 
for n > 1. Thus, 0 = P3(0) = Ps(0) = P;(0) =.... 


REMARK 6.18  P,,(0) = P,(1) = Bn (the n-th Bernoulli number) Also, 
for n odd, B, = 0. 


We now complete the proof of Lemma 6.2. Consider 
b 1 
/ f(a)dz = i g(z)dz, where g(z) = (b—a)f(a+(b—a)z). 
a 0 


We will show that 


Jf steyde = slot0) + 90) - » Bet (ayer? — (oy 


Bom+2_(2m+42) 
(2m + 2)! (6) 


for some 0 < € < 1; it is easy to see that Lemma 6.2 follows immediately from 
this. Consider 


1 


rs Pi (z)g'(z)dz 
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since P,(z) = P,,,(z)/(n+1) and P,(1) = P,(0) = 1/2. Performing another 
integration by parts yields 


[ sta)az = 51000) + 0) - Bo) 9 +5 | POs" oa: 


Continuing this procedure and using Bo,+4, =0 for n = 0, 1, 2,..., 


fF s(eide = Glo) + 900) — 3 aap [aPC - 9-00) 


1 ‘l: 
+ ome f Ponsaledart ede 


The last term in the above sum can be expressed as 


Bam+2 Cre (am+1)(Q)] _ Bom+2 : (2m+2) (2) dz 
(Qm +2)! 19 ? ~ Qm+2)! Jo 7 


Thus, the last term in the sum and the integral give 


ss DI i (Pom+2(z) — Bam+2)9?"™* (z)dz (6.52) 


eee 
— ao), (Pam+2(z) = Bom+2)dz 


go), 3) Pom+3(% z) = —g?™+2) (£) Bomio 
~ Qm+2)! ff ( 2m +3 ~ Bana) aaa roo) | a 


since Pom+3(1) = P2m+3(0) = 0; 

n (6.52), we used the fact that G(t) = Pan(t) — Pon(0) = Pon(t) — Bon 
does not change sign on [0,1]. To see this, suppose that G has a zero in (0, 1). 
Then, since G(0) = G(1) = 0, Rolle’s theorem implies that G’ has two zeros 
n (0,1). Since P2,-1 is a multiple of G’, P2,-1 must also have two zeros in 
(0,1). But P2,_1 also vanishes at 0 and 1, so P3,,_, has three zeros in (0,1). 
Continuing in this manner, P;,, for odd k < 2n, must have at least two zeros 
n (0,1) as well as zeros at 0 and 1. This is impossible, since P3 is a cubic 
polynomial. 


Lemma 6.2 immediately gives us 


THEOREM 6.4 
For f € C?™*2/a, b], 
1 


HoH 


~ Boh”) 2j-1 Qj-1 
[se te) + Feeraa — oBE UOM) — Fa) 
“t = 2) Bam gh? pem+2) (¢) 


(2m + 2)! 
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for some € € [a,b], where x, =a+vh,v=0,1,...,N andh=(b—a)/N. 


PROOF Let a= 2, and b= 2,4, in Lemma 6.2. Then 


ae Ly “ by h 
[tea = UF.) + fen))- ar FY (aga) 
2 
$09 (ay) — Pease fom 6), 


where h = t,41 — t) = (b—a)/N and x, < & < 2,41. Now sum both sides 
from vy = 0 tov = N—1. Then 


: i Bo 
/ Fladde = 5 YU loe) + Flores) = 5. Be pipes — f2- a) 


«yj (27)! 
Bomyo(Nh)h2m+? SX fem+2)(E,) 
(2m + 2)! a N 
he B 
29 j— 
= 5 DIF @) + fea) -y FO) — F(a) 
v=0 ca 
— Bom-+2 (b = ae Kenora) (e) 
(2m + 2)! ; 
since 
N~-1 ¢(2m+2) 
Ac aiid) i (Ev) (2m+2) 
eS ae ee 


v=0 


Finally, this and the Intermediate Value Theorem give 


= f(2m+2)(¢,,) 


en se 


U 


Now, let us examine the scheme illustrated in Table 6.2 when we use the 
trapezoidal rule as the base method. For this Romberg integration, define 


SU) = hy [GH+ Yo fla tvns) +540) ] 


v=1 


where 


a ie 
hj= 5, Nj=2N and h=——, 
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and, in general, define 


gt py — BSAA) = SPA) 


Jj sd = 
j =~. ogee for 7 =0,1,2,... andk=0, 1, 2,... 


Table 6.2 then becomes Table 6.4. In Table 6.4, the last row indicates that o 


TABLE 6.4: Romberg integration and composite Newton—Cotes 


methods 
a O(}®) 
oP - 
o- = 
Sy _ 
5° s® 
se 3) 
composite composite Sao es a ia he pe : 
trapezoidal Simpson’s ewton—Cotes 


of order O(h®) | methods 


happens to be the composite Simpson’s rule with 7 subintervals, while So 
happens to be the composite closed Newton—Cotes formula of order O(h®), 
while ei for k > 2 does not correspond to any Newton—Cotes method. 

The Romberg procedure is generally continued until Dy Toke SU ) <€, for 
a given tolerance «. Then, the best estimate of the integral is so ) 


REMARK 6.19 _ It can be proved that if f € Cla, 6], the values in each 
column of Table 6.4 converge to the integral [49]. 


REMARK 6.20 Anefficient formula for the computation is the recursion 
relation 


5.) = EW +N), 


where T;(f) =h; >> f(e;+vh;), with c; = a+h,/2. Use of this relation 
v=0 
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requires only N; additional computations of the function f(x) to calculate 


ou; (f) instead of 2N;. 


Example 6.10 
Numerical evaluation of the sine integral. Compute 


l sing 
dx. 
9 «6 


Solution: With Romberg integration, we obtain the following, corresponding 
to Table 6.4. 


8 intervals 0.945691 s® = 0.946083 


Since Ss?) and 3) differ by one unit in the sixth digit, this is evidence that 


s®) is accurate to six digits. ] 


We now consider the question: Can Richardson extrapolation be applied 
to other composite numerical integral rules besides the composite trapezoidal 
rule? It turns out that, indeed, it can be applied to basically any composite 
rule that integrates constant functions exactly. Before addressing this question 
in detail, consider the following example. 


Example 6.11 
The composite midpoint rule applied to the sine integral. As in Example 6.10, 
our task is to compute 


but we now use the composite midpoint rule for S instead of the trapezoidal 
rule. We obtain 


1 interval | $0 = 0.958851 


2 intervals 0.9492337 | 5“ = 0.946028 
0.9468682 0.946080 | S@ = 0.946083 


4 intervals 
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0 


In [10], Baker and Hodgson show that almost any composite rule has the 
correct form for the error for performing Richardson extrapolation. They 
obtained the following result. 


THEOREM 6.5 
Let 


(The above is the composite rule for m intervals on [0,1], t.e., h = 1/m.) 


Suppose that )~ a; =1 and f € CN[0,1]. Then, 
j=0 


N-1 


B.(f) = I(f) — elf) = So cxh® + O(H%), 


k=1 


where the cx,’s are independent of h. (Thus, the error has the correct form for 
applying Richardson extrapolation.) Also, 


(a) If Q(f) = J(f) for polynomials of degree <r, thenc, =0 forl <k<r. 
(b) IfQ(f) is a symmetric quadrature rule’ then con11 = 0 fork =0,1,2,..., 


and thus odd powers of h in the error expansion vanish. 


Example 6.12 
Examples of symmetric and nonsymmetric quadrature rules 


(1) The midpoint rule: n = 0, ao = 1, to = 1/2. This a symmetric rule 
with degree of precision 1. Thus, by Theorem 6.5, part (a), c: = 0, and 
by part (b), corgi = 0 for k= 1, 2, .... Hence, 

Eel f) = coh? + cah* + cgh® +..2, 
the same as for the trapezoidal rule. 


7A symmetric quadrature rule is a quadrature rule in which the points and weights are 
symmetric on [0,1]), ie., aj = Qn—i, ts =1—tn—; fori =0,1,...,n. 
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(2) The rectangle rule: n = 0, a 1, to = 1, that is, 


as Lf), 


This is not a symmetric quadrature rule, and the degree of precision is 
0. Hence, 
Ec(f) = e1h + egh? + e3h? + 


(3) 4-point Gauss quadrature: This is a symmetric rule. (See Table 6.1 
on page 349 for the points and weights.) Also, this rule is exact for 
polynomials of degree < 7. Hence, 


E.(f) => cgh® + Cioh'? + cigh!* + 


6.3.5 Multiple Integrals, Singular Integrals, and Infinite In- 
tervals 


We describe some special considerations in numerical integration in this 
section. 
6.3.5.1 Multiple Integrals 


Consider 


[ [ se.navee or [ff tev 2revauas 


How can we approximate these integrals? One way is with a product for- 
mula, in which we apply a one-dimensional quadrature rule in each variable. 


Consider 
b d 
J Uf sendy) ae 
p L-1 Ce41 
=| ay, Fe.s)dy) dar 
eS i 
-[> ae ey 26+ (204+ 1h, hy ay! de 
2 2 2 
pial on shy + 2c+ (26+ 1h 
~ | S- So aihyf pee de, 
@ ¢=0 i=0 
where 
d—c : 
hy =——, ce=cetlh, andy = % fori=0,1,...,n, 
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and where the one-dimensional quadrature rule is assumed to be of the form 


/ g(z)dz & 25; aig(zi). 
th i=0 


(Note that we have applied a one-dimensional composite rule over the y vari- 
able.) The integral over the x-variable is treated similarly, giving 


b d 
Lf f(e,y)dyde 
BETHEL ajhe +2a+(2k+1)he yshy +2c+ (20+ 1)hy 
Ss ehh) (ae ie eee) 
£=0 k=0 i=0 j=0 


where h, = (b—a)/K, hy = (d—c)/L, and x; = z; for 7 =0,...,n 
The same procedure can be used for triple or higher multiple integrals. 
6.3.5.2 Singularities and Infinite Intervals 


Consider [” f(x)dx. Suppose that f is Riemann integrable but has a sin- 
gularity somewhere on |{a, b]. (Alternately, for example, f may be continuous 
but f’ may have a singularity on [a, b], which results in low accuracy of the nu- 
merical quadrature methods used unless a large number of intervals is taken.) 
Without loss of generality and to fix ideas, suppose that f is Riemann inte- 
grable on (0, 1], continuous on (0, 1] but not continuous on [0, 1]. For example, 


[ sever= os or fx f(a yar = (cos 2 — 1)71/3 de. 


There is a variety of procedures that can be used to perform the integrations 
numerically or to increase the accuracy of the numerical quadrature method. 
In particular, we can 


(i) truncate the integral, 
(ii) change variables, 
(iii) subtract out the singularity, 
(iv) ignore the singularity, 
(v) use a singular integration rule. 
We now give some details for these five techniques. 


(i) truncate the integral — 


Procedure: Evaluate ie f(x)dx numerically and estimate fj f(x)dz. 
Example: Suppose that 


[ seve =k Me te.g € C101 
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[ seome= f seraes f° Ma. 
[sorte = fF reer -|[2 


If max \g(x)|2,/e is sufficiently small, then estimating So F( x)dx by 


Then, 


Hence, 


< max |g(2)[2Ve. 


O<a<e 


ihe f (ax)dzx is reasonable. 
(ii) Change variables — (may eliminate the singularity) 
Example: Suppose that f(x) = a2~!/"g(x),n > 2 and g € C(0, 1]. 
Let 2= 2" 70> 2, de = nt Pan: 
Then fot x)dx = i z—U"¢(x)dr = nfo t?—2(t")dt which is a new 
integral ase the singularity at t = 0. 


(iii) Subtract out the singularity — 
Example 1: 


[3 Cost _f#, cost 1 | 24 "cost = 1 | 
—= ——dr = ——dr, 
0 Ve Jo vr 0 vz 
where the last integral is not singular at « = 0. (Recall cosa = 1— = + 


a 


Example 2: 


1 1/2 
| a dx 
9 l+a 
i got? a pl? 1/2 
= diz 
0 1l+2a 


1 ,,—1/2 1 41/2 1 1/2 
-| aie | z dx =2- [ ier 


5 poe 
—s _ er enel 6/4 62 
0 (1 +2) 


1 1 3/2 
=2- / Pact | de 
0 9 l+2 
1 3/2 
2-24 [ 7 as 
3 9 l+2 
1 3/2 _ 5/2 
=2-24 / oi (+a)-2" 7 
3 Jo (14+ 2) 
gute Pee 
= aero i+o 
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(Notice that the integrand of the last integral is smooth having two 


continuous derivatives on [0,1]). Applying the composite midpoint rule 
with N intervals to the above integrals yields the results in Table 6.5. 


TABLE 6.5: Subtracting out the singularity, Example 2 


2 | 1.12991 1.55256 1.56891 1.58165 


4 1.26259 1.56374 1.57005 1.57345 
8 1.35466 1.56820 1.57057 1.57145 
16 1.41872 1.56986 1.57073 1.57096 
32 1.46355 1.57046 1.57077 1.57084 
64 1.49507 1.57068 1.57079 1.57081 
128] 1.51729 1.57075 1.57080 1.57080 
(iv) Ignore the singularity — In this procedure, we use a rule, such as 


composite Gaussian quadrature, that does not involve evaluation of f(0). 


Example: 
eels 
| Sean 
0 1 + a2 


The method works slowly. Many intervals are needed to obtain an ac- 
curate result. 


(v) Develop singular integration rules — 
Example: Suppose that k(x) is a weight function with singularity at 
x = 0 and k(x) > 0 for 0 < x < 1. Suppose also that ie k(x) a da 
exists for 7 = 0,1,2,...,n. We can then derive quadrature rules of 
interpolatory type or Gaussian rules. Given a subdivision 0 < x < 
Uy <-+++ <&y <1, we can find a; such that 


[ f(a)dx = [ k(x) g(a)da & Yaugled 


is exact for g(a) =1,z,...,2”". For example, 
: ey 14 8 4 
[ feode = [ot o(ayae ~ Sa(a/3) ~ 3912/3) + S90) 
0 0 


is exact for g(x) = 1, g(x) = 2, g(x) = x?. 


Similarly, several procedures can be applied to numerically approximate 
integrals over infinite integrals: 
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(i) Change variables. 
(ii) Truncate the integral. 


(iii) Use a quadrature rule appropriate for infinite intervals. 


(i) Change variables — 
Example: 


(a) Setting «= e¥, y= —Ing, 


ie fly)dy = i HOM) te = a HO ge. 


(b) Setting y = (x —a)/(b— x), = (a+ by)/(1+y), 
dy = (b—a)/(b—2)?, 


[ t= 6-4 | HE), 


(b— 2) 


(ii) Truncate the integral — Consider f° f(a)dx ~ ie f(x)dz, and esti- 
mate fp f(«)dx by some other means. 
Example: 

2 


2° —2x 10 —2x 
0 1 +2 0 1 +2 
ee) —x? 
=| xe~* dx 
re ae eco 
1 me 5 
———_— 2" 
= T+ tot I, vE L, 
1 1 
~ T+ 1042 e 100 which is very small. 


(iii) Use a quadrature rule appropriate for an infinite interval — 
Consider a formula of Gauss type: 


[ee stear = wnt), 


k=1 


where the points and weights have been chosen so that the approxima- 
tion is exact for polynomials of degree < 2n—1. It is assumed here that 


ia w(x)a*dx < oo for k =0,1,...,2n — 1. Two such examples are: 
(a) fo e*fle)de = Swi flor), 
0 k=1 


where the points x, are the zeros of the Laguerre polynomial L,, (x) 
and wy satisfy wz = (n!)?a~/(Ln41(e))? 
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ae k=1 
where the points x, are the zeros of the Hermite polynomial H,,(x) 
and wy satisfy wy = 2°t1n!./m/(An41(xx))?. 


See [22] for more details on numerical integration of improper integrals. 


6.3.6 Monte Carlo and Quasi-Monte Carlo Methods 


In multiple numerical integration, the number of computations is propor- 
tional to N™ where N is the number of intervals used in the composite quadra- 
ture rule and WM is the number of iterated integrals, assuming N intervals are 
used for each integral. Thus, if for example M = 6 and N = 100, the number 
of computations is proportional to 10!7. For high-dimensional problems, a 
Monte Carlo approach can be useful. We consider here a one-dimensional 
problem, but the Monte Carlo approach is independent of dimension. 

Let 


1 1 
HA) =f fede = fi w(a) flea, (6.53) 

0 0 
where u(x) is the uniform probability density on [0,1], that is, prob(c < a < 
d) = bs p(x)dx = d—c, assuming that 0 < c<d< 1. (We can always convert 
Ve f(Z)dz to ie f(x)dx.) The mean value of f(a) over interval [0,1] is thus 
J(f). Let 11, v2, @3, ..., YZ, be n points randomly selected from a uniform 


distribution on [0,1]. (Generally, a pseudorandom number generator is used.) 
We can then form the average 


fn = = > Flea) (6.54) 


and we would expect f, — J(f) as n — co. We can estimate the error in 
approximation (6.54) using the Central Limit Theorem. 
First, the mean value of f on [0,1] is 


I(f) = i fla)u(e)de = f° fa)ae, (6.55) 
and the variance of f on [0,1] is 
a= f (F(x) - 1(f))?wlo)de = f° P(@)de - P(A). (6.56) 
0 0 


The Central Limit Theorem tells us that 


a eee 
/ e* dx (6.57) 


1 n 
prob pi Fles)— IO) < poor oran ae 


ale 
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for large n. (The Central Limit Theorem says, for large n, the sampling 


distribution of means with mean jz and variance o? is approximately a normal 
distribution with mean py and variance o?/n.) Thus, for example, if \ = 1.96, 


then 
prob ( 


That is, the error satisfies 


1S” F(a) - 0) < | ~ 0.95. 
i=1 


1.960 
Jn 

with probability 0.95. Hence, to reduce the error, we need to have large n or 
small a. Also, notice that the error is statistical in nature and is proportional 
to 1/,/n. In the following example, the Monte Carlo approximations generally 
improve as 7 increases. In addition, notice that only a single sum is required 
in the approximation of the multiple integral. However, the error is still 
proportional to 1/,/n, where n is the number of points selected. 


E(f) =|fn — TAS 


Example 6.13 


is AL CAL st 
i | ip | e@1%2%3%4 day dxg dx3 dr4 © 1.0693 
o Jo Jo Jo 


n 
i T1kL2KLZKLAK 
n |= » e 
=1 


0 


There is also a variety of ways to reduce the variance g?. One method is 
called importance sampling. Consider 


J(f) = | 2 ple). where p(x) > 0 and : p(a)dx = 1. 


Instead of the approximation (6.54), use 


where the 2; are random variables extracted from probability density p(x) on 
(0, 1]. When we do so, the variance of f/p with respect to the density p is 


om [serene (Beton) 
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and if p(x) + q~£_., then o? ~ 0. 
fs f(x)dx’ 


Example 6.14 
Let f(a) =e”. Replace 


The original o? is 
2 


1 1 
1 
a i e** dx — (| cde) = 2e' —3/2- oe ~ 0.24204. 
0 0 


The new o? is 


Sup ere ; - 
— sf dx — (/ eae) = 0.0269. 
2Jo 1+2 0 


The new o? is about a factor of 10 smaller than the original o?. 


For more information on Monte Carlo methods in numerical integration see, 
for example, [1], [22], [28], or [55]. 


Quasi-Monte Carlo Methods There has been much recent interest in a 
new class of methods, quasi-Monte Carlo methods, which are actually deter- 
ministic rather than statistical [28, 66]. Like Monte Carlo methods, quasi- 
Monte Carlo methods are easily applied to multiple integrals, but generally 
have the advantage of higher accuracy than Monte Carlo methods. Consider 
a one-dimensional problem and let 


1S : 
HL flew) =f Hayar 


Ew(f)| = 


If 
ewDl <e([ Ur@lan) SEM 


where c and & are constants, then the sequence of points {%,,}°2, is said to 
be a quasi-random sequence or a sequence of low discrepancy points. 


Example 6.15 
Consider the Halton sequence of low discrepancy points, as follows. Let r be 
m 


a prime, and let n = 1,2,... be written in base r as n = S> aj;(n)r?. Then, 
j=0 
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{y(n)}°2, is the Halton sequence, where 
(n) = Yh aj(njrF*. 
j=0 


(Notice that the numbers are reflected about the decimal point.) For example, 
r = 3 in the following table, 


Lx 3? ge agent 
2: | DXi? Io a2 
3}0x3°+1x3!]0x3-'+1x3-7=1/9 
4)1x3°+1x3'|1x3-!14+1x3%=4/9 
By 23eBP 1X Bh ONS SEL BA = 7/9 


For a multidimensional (d dimensions) low discrepancy Halton sequence, 
let p1, po, .--, pa be the first d prime numbers. Let 


n= Soap’, where a; € [0,p; — 1] fori =0,1,...,m 


Let 


aa 
Pp; (n = 2 S 


Then, zn = (Yp, (7), Ypa(N),-- +; Yon - is the n-th d-dimensional point in 
the sequence. 


To motivate quasi-Monte Carlo methods that use low discrepancy sequences, 
consider numerical integration in d dimensions. First, consider d = 1 and a 
standard numerical integration rule such as the composite trapezoidal rule. 


We have : 2, 7 
: f(u)du = downs (=) ; (6.58) 


where wo = Wm = 1/(2m) and wy, = 1/m for 1 <n < m-—1. The error in this 
rule is proportional to 1/m?, assuming that f € C?[0,1]. Now, for general 
dimension d, we have 


es [sean =O De ttn tad (Be BH), 
: (6.59) 


The total number of nodes in (6.59) is N = (m+ 1)¢. The error in (6.59) 
is O(—y). In terms of the number of nodes, the error is O(N-2/4), Thus, 
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with increasing d, the approximation error increases rapidly. To guarantee 
a prescribed level of accuracy, say to guarantee an absolute error less than 
10-4, we must use roughly 1074 nodes. Hence, the number of nodes increases 
exponentially with d, to maintain a given level of accuracy. 

The Monte Carlo method helps to overcome this problem of dimensional- 
ity. The error in the Monte Carlo approximation is less than 1.960/ VN with 
probability 0.95. Hence, the Monte Carlo method is said to have error pro- 
portional to 1/ VN. Notice that this error is independent of the number of 
dimensions and, for a large number of dimensions, Monte Carlo methods are 
more attractive than classical integration rules. By using low discrepancy se- 
quences, an error bound of O ((log No = EEN ) is possible. Thus, quasi-Monte 
Carlo methods can be more accurate than Monte Carlo methods for large d. 


Example 6.16 
1 

Consider / 5atdr = 1. 
0 


N (number | Monte Carlo | Halton 


of points Estimate Estimate 

500 1.18665 0.98661 
1000 1.13313 0.99242 
1500 1.09591 0.99375 
2000 1.07040 0.99623 
2500 1.03708 0.99564 
3000 1.04790 0.99713 
3500 1.05123 0.99741 
4000 1.03659 0.99709 


6.3.7 Adaptive Quadrature 


If the function varies more rapidly in one part of the interval of integration 
than in other parts, and it is not known beforehand where the rapid variation 
is, then a single rule or a composite rule in which the subintervals all have 
the same length is not the most efficient. Also, in general, routines within 
larger numerical software libraries or packages, a user typically supplies a 
function f, an interval of integration [a,b], and an error tolerance €, without 
supplying any additional information about the function’s smoothness.® In 


8A function is “smooth” if it has many continuous derivatives. Generally the “degree of 
smoothness” refers to the number of continuous derivatives available. Even if a function has, 
in theory, many continuous derivatives, we might consider it not to be smooth numerically 
if it changes curvature rapidly at certain points. An example of this is the function f(«) = 
Va? +e: as € gets small, the graph of this function becomes indistinguishable from that of 


f(x) = |2|- 
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such cases, the quadrature routine itself should detect which portions of the 
interval of integration (or domain of integration in the multidimensional case) 
need to have a small interval length, and which portions need to have a larger 
interval length, to achieve the specified tolerance e. In such instances, adaptive 
quadrature is appropriate. 

Adaptive quadrature can be considered to be a type of branch and bound 
method.® In particular, the following general procedure can be used to com- 


pute is f(a)dz. 
1. (Initialization) 

(a) Input an absolute error tolerance €, and a minimum interval length 
}. 

(b) Input the interval of integration [a,b] and the function f. 

(c) sum < 0. 

(d) £ <— {[a,b]}, where CL is a list of subintervals that needs to be 
considered. 


2. DO WHILE L# 9. 


(a) Remove the first interval from £ and place it in the current interval 
[c,d]. 

(b) Apply a quadrature formula over the current interval [c, Z] to obtain 
an approximation I,. 


(c) (bound): Use an error formula for the rule to obtain a bound E, 
for the error, or else obtain E, as a heuristic estimate for the error; 
This can be done by either using an error formula or by comparing 
with a different quadrature rule of the same or different order. 

(d) IF E. <«(¢—c), THEN 

sum < sum+ I;. 

ELSE 

IF (€—c) <6 THEN 
RETURN with a message that the tolerance € could not 
be met with the given minimum step size 6. 

ELSE 
(branch): form two new intervals [c, (¢ + €)/2] and [(¢+ 
¢)/2,¢], and store each into the list £. 

END IF 


END IF 


9We explain another type of branch and bound method, of common use in optimization, in 
9.6.3 on page 523. 
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END DO 


A good example implementation of an adaptive quadrature routine is given 
in the classic text [30] of Forsythe, Malcolm, and Moler.'° This routine, 
quanc8, is based on an 8-panel Newton-Cotes quadrature formula and a 
heuristic estimate for the error. The heuristic estimate is obtained by com- 
paring the approximation with 8-panel rule over the entire subinterval J, and 
the approximation with the composite rule obtained by applying the 8-point 
rule over the two halves of [.; see [30, pp. 94-105] for details. The routine 
itself!! can be found in NETLIB, presently at http: //www.netlib.org/fmm/ 
quanc8.f. 


6.3.8 Interval Bounds 


Mathematically rigorous bounds on integrals can be computed, for sim- 
ple integration, for composite rules, and for adaptive quadrature, if interval 
arithmetic is used in the error formulas. As an example, take the two point 
Gauss—Legendre quadrature rule: 


[ seoar={7(S) +7(S) + Bree, 5a) 


for some € € [—1,1], where the quadrature formula is obtained from Table 6.1 
(on page 349) and where the error term is obtained from Formula (6.38) with 
a=-—1,b=1, and 


De ee ae 8 
dz - SS pat. 
[ rantode= f (88-3) = 


Now, suppose we want to find guaranteed error bounds on the integral 


1 
i eo dy, 
=a, 


Then, the fourth derivative of e9-!” is (.1)*e°'”, and an interval evaluation of 
this over x = [—1, 1] gives 


(0.1)*e°** € [0.9048, 1.1052] x 10-* for x € [-1, 1], 
where the interval enclosure for the range e? [1] was obtained using the 
MATLAB toolbox INTLAB [77]. The application of the quadrature rule thus 


10This text doubles as an elementary numerical analysis text and as a “user guide” for the 
routines it explains. It distinguished itself from other texts of the time by featuring routines 
that were simple enough to be used to explain the elementary concepts, yet sophisticated 
enough to be used to solve practical problems. 

11In Fortran 66, but written carefully and clearly. 
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gives 


1, 
‘/ eo de € e 9 N/V3 4 6-1/V3 + (9.9048, 1.1052] x 1074 
-1 


e 0-1/[1-7320,1.7321] + 00-4 /[1.7320, 1.7321] + (0.9048, 1.1052] x 1074 


Cc 
C [2.0034, 2.0035], 
where the computations were done within INTLAB. This set of computations 
provides a mathematical proof that the exact integral lies within [2.0034, 2.0035]. 
The higher order derivatives required in the quadrature formulas can be 
bounded over intervals using a combination of automatic differentiation (ex- 
plained in §6.2, starting on page 327, of this book) and interval arithmetic. 
The mathematically rigorous error bounds obtained by this technique can 
be used in an adaptive quadrature technique, and the resulting routine can 
give mathematically rigorous bounds, provided J, and sum are computed with 
interval arithmetic and the error bounds are added to each J, when it is added 
to sum. Such a routine is described in [21], although an updated package was 
not widely available at the time this book was written. 


6.4 Exercises 


1. Carry out the details of the computation to derive (6.60). 


2. Assume that we have a finite-difference approximation method where 
the roundoff error is O(e/h) and the truncation error is O(h”). Using 
the error bounding technique exemplified in (6.10) on page 326, show 
that the optimal h is O(e!/("+)) and the minimum achievable error 
bound is O(e"/(@+)), 


3. Consider the finite difference formula (6.8). 
(a) Derive an error bound (bounding both roundoff and truncation) 
for this formula analogous to (6.10). 


(b) Compute an optimal h and a minimal achievable error, as was done 
following (6.10). 


(c) Produce a table similar to that in Example 6.2, but using this 
formula. 


(d) Compare the results in your table to the optimal h and minimal 
achievable error you have computed. 


4. Repeat Exercise 3, but for formula (6.9) instead of formula (6.8). 


5. 


10. 


11. 
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Assume that f € C[a,b] and xo, xo + h, xo + 2h € [a,b]. Prove that 
there exist constants cj and c2 such that 


f"(00) ~ 5 | $y leo) + eran + A) + enf20 + 20) 
Sen max, |f"(2)|, 


where c > 0 is a constant independent of h. 


. Let f € C™%(—o0, co) and let x9 € R be given. 


f(zo +h) — f(%o —h) 
2h 
cGj,1 = 1,2,3,... are independent of h. 


(a) Prove that C, = = f'(a0o)+>~ cjh™, where 


i=l 


(b) Suppose that C;, and C’ » have been calculated. Find constants a1 
and a» so that 


ayCh + a2C'n = f'(xo) + O(n’) . 


. Write down a general formula for the (k + 1)-st component of sin(uv), 


as in Formula (6.14) on page 329. Note: It is permissible to look this 
up, in this case. 


. Write down a general formula for the (k + 1)-st component of (uv)”, 


following the forward differentiation scheme of 86.2.1. Note: as with 
Exercise 7, it is permissible to look this up, in this case. 


. Complete the computations for Example 6.4. 


Solve the system (6.19) (on page 333) and compare your result to the 
corresponding directional derivative of f computed by taking the gradi- 
ent of f and taking the dot product with the direction. 


Suppose Q(f) is any integration rule of the form (6.21) (page 334) that 
integrates constants exactly, that is, such that 


[re—an) =a) 


Further, assume that f € C[a,b], and apply the composite quadrature 
rule (6.23) (on page 335) based on Q. Prove that 


N-1 


b 
Bf) =f Fle)de— Y QL1f) +0 as N00. 


v=0 
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12. 


13. 


14. 


15. 


16. 
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Let Qm(f) denote the m-point Gaussian quadrature rule over the inter- 
val [a,b] and with continuous weight function p, that is, 


HE b 
= dof (ei) x J(f)= / p(x) f(a)dx. 


Show that, if a and bare finite and f is continuous, then Qn(f) > J(f) 
as M — oOo. 

(Hint: You may wish to consider the Weierstrass approximation theo- 
rem, page 205, as well as Theorem 6.8 on page 352.) 


Assume that f € C?{a,b]. Let M = es imcale 
“LEla 


a+b M 


b 
(a) Prove that i f(a)dx — (b— a) f( 5 P< (6- a? 


P Ra Pd ide 8 
= ; iat 
J foar- p(BABAL)) < (6 a) pA 
a j=0 
Let f € C[0,1] and let Yom x;) be an approximation of Ae a dx. 


Assume that 0 < w; . ae Te ce gery er 7 
for any n. Let E,,( ae eed x)dx — >>;_, wi f(z) be the error in the 
approximation. ae that oe (P,) = 0 for any P,,, a polynomial of 
degree < n. Prove that given « > 0 there exists an N > 0 such that 
|En(f)| <¢ for alln > N. 


Consider the integral approximation: te, f(a)dz & f(—a) + f(a) 


(a) Prove that the error in this approximation is bounded by 


1 
(a? — a? +5) max, |f"(2)|. 


(b) What is the optimal a in the sense that the integration rule is exact 
for the highest degree polynomial possible? 


Answer the following: 


(a) Consider the formula 


i Piet {Ar(o) + Bis) + cr} 


Find A, B, and C such that this is exact for all polynomials of 
degree less than or equal to 2. 
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(b) Suppose that the Trapezoidal rule applied to het x)dx a the 
value 4 5 while the quadrature rule in part (a) applied S bot x)dx 
gives the value 5. If f(0) = 3, then show that f(3) = 


1 N- 
1 
17. Let | f(x) da = » f(x:)h where x; = ih and h = —. Suppose that 


N-1 


N 
cl / 
ie f(a) dx — df xi) Sy max |f’(x)|. 


O<a@<1 


f € C'[0,1]. Prove that 


18. Consider the quadrature formula of the type 


i! f(x) [a In(1/x)] dx = ao f (0) + ai f(1). 


(a) Find ao and a, such that the formula is exact for linear polynomials. 
(b) Describe how the above formula, for h > 0, can be used to approx- 
imate [ g(t) t In(h/t) dt. 
0 
b 
19. Suppose that J(h) is an approximation to f(x) dx, where h is the 


a 
width of a uniform subdivision of [a, 6]. Suppose that the error satisfies 


b 
~ ‘i f(a) dx = ch + coh? + O(h3), 


where c, and cg are constants independent of h. Let I(h), I(h/2), and 
I(h/3) be calculated for a given value of h. Use the values I(h), I(h/2) 


and I(h/3) to find an O(h?) approximation to [ FCG) dat: 


20. Use a two point Gauss quadrature rule to approximate the following 


integral: 
A 1 o-2/2 ¢ 
— — ee XL 
~c V2T 


21. Find the nodes x; and the corresponding weights A;, i = 0,1,2, so the 


formula 
2 


I 1 
—— f(z) da = Aif (xi) 


1=0 


is exact when f(x) is any polynomial of degree 5. Compare your solution 
with the roots of the Chebyshev polynomial of the first kind T3, given 
by T3(x) = cos(3.cos~1(a)). 


380 


22. 


23. 


24. 


25. 


26. 
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Let ®(x) be the piecewise linear interpolant of f(a) on the partition 


@=% <2 <... < & = 6, where 7; = a+ jh for j = 0,1,...,n 
b 


and h = =*. Show that / ®(x) dx is equivalent to the composite 


a 


b 
trapezoidal rule approximation to ' f(a) dx. 
a 


1 
Let f € C'([0,1] x [0,1]) and h = = Show that 


n—-1ln-1 


f ip f(x,y) dudy — 2 > S* fh, jh)| < ch 


i=0 j=0 


for some constant c independent of h. (Recall 


flay) = Fla,b) + SE (8a — a) + E(u, ely —8) 
at y 


for some (,,€) on the line between (x,y) and (a, b).) 


Suppose that a particular composite quadrature rule is used to approx- 
2 


imate | e® dx. The following values are obtained for N = 8, 16, and 


0 

32 intervals, respectively: 16.50606, 16.45436, and 16.45347. Using only 
these values, estimate the power of h to which the error is proportional, 
where h = =. 


A two dimensional Gaussian quadrature formula has the form 


fe I. fla, y) dx dy = fla, q) a fla, a) + fla, —a) + f(-a, —a) 


Find the value of a such that the formula is exact (ie. E(f) = 0) for 
every polynomial f(x,y) of degree less than or equal to 2 in 2 variables 


2 
ie, f(a, y) = ss aija'y’. 
i,j=0 


1 
Consider approximation J(f) = ) e” dz using the Monte Carlo method. 
0 


Find the number of points n so that 


1X : 
b >) vi “ad 
pro (| yee x 


< ooo = 0.997. 


Chapter 7 


Initial Value Problems for Ordinary 
Differential Equations 


7.1 Introduction 


In this chapter, we are interested in approximating the solution of the fol- 
lowing initial-value problem (IVP) for a first-order system of differential equa- 
tions. We seek to approximate y : [a,b] > R” that satisfies 


y(t) =f(ty(t)),  a@<t<b, 
ae = Wo, 1) 


where f is a given R”-valued function of n+ 1 real variables and yo is a given 
vector in R". That is, we seek n functions y;(t), 1 < i < n, defined for 
a<t<bsuch that 


mae = fi(t,yi(t), yo(t),---,yn(t)), Ls<i<n, a<t<s, 
yi (a) = VYo,i- 


For example, for n = 2, 


y(t) = cost + yi(t)ya(t) = filt, yi, ye) 
yo(t) = yo(t) — y(t) = falt, 1, y2) 


However, for a general function f, problem (7.1) need not have a solution on 
the interval [a, b]. For example, in R*, consider 


y(t) =yOP?, O<t<3, y(0)=1. 


The function y(t) = (1 —t/2)~? is a solution to this IVP on the interval [0, 6], 
0 <b < 2; this solution blows up as t — 2. In fact, a solution to this IVP 
does not exist over the entire interval [0, 3]. 

Additionally, even if a solution to the IVP exists over all of [a, 6], the solution 
may not be unique. As an example of this phenomenon, consider the IVP 


y@)=lyOr?, O<St<1, y(0)=0. 
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Clearly, y(t) = 0,0 <t<1isa solution. Also, it is easily shown that 


0, 0<t<1/2, 
TO) caine 1/2<t<1 
is also a solution. Hence, solutions to this IVP are not unique on (0, 1]. 
Problem (7.1) has a unique solution in some interval’ about t = a if 
fi, fo,::: fn are continuous and possess continuous first partial derivatives 


as stated in the following theorem. Details and a proof can be found in many 
books on differential equations, such as [94]. 


THEOREM 7.1 
(local existence and uniqueness) If there exists a positive number y such that 


fi, fo,---,fn are continuous and possess continuous first partial derivatives 
with respect to the components y1, Y2, +--+; Yn Of y, for 
[E-al Sy; ta — yo] <9 Ya — goal <*> s- Be Boel <o% (7.2) 


then there exists an n > 0 such that the system (7.1) has a unique solution 
for |t —a| < 7. 


Nonlocal existence is guaranteed by the following theorem. A classic refer- 
ence for Theorem 7.2 is [18]. 


THEOREM 7.2 
Let f be a continuous vector-valued function? defined on 
S={(t,y) |teR,y €R”,|t—al] <7, llyll < cof, 


and let f satisfy a Lipschitz condition in the y variable over S. Then y! = 
f(t,y), y(a) = yo has a unique solution for |t — al < ¥. 


Analogously to Definition 2.2 (page 40), we have 


DEFINITION 7.1 ff satisfies a Lipschitz condition with respect to y 
provided 


If(t,¥) — FEMI S Lily — all, (7.3) 


where L is a constant (Lipschitz constant) independent of y, y, and t, and ||- || 
is some norm in R”. 


Tnot necessarily a prespecified interval [a, b] 
? That is, f(t, y) = (filt, y) ae) frit, y))". 
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REMARK 7.1 _ Note that if f is Lipschitz continuous with respect to one 
norm in R”, then, by equivalence of vector norms, it is Lipschitz continuous 
with respect to any other norm in R” but with a different Lipschitz constant. 
In addition, if each Of;/Oy;, 1<i<n,1< J <n is continuous and bounded 
on S, then application of the mean value theorem in componentwise form 
yields 


filt,y) — filt, 9) = eas (y5 — 95) 


for some point c; € R” on the line segment in R” between y and y. (This 
follows from the multivariate mean value theorem, stated as Theorem 8.1 on 
page 441 and proven in Exercise 1 on page 482 in Section 8.1). By letting 


Ofi( (t, y) 
=> a. a. 
= piieee me x > oe al 


the multivariate mean value theorem gives condition (7.3) for || - || = || - loo. U 


REMARK 7.2 _— Traditionally,? IVP’s for higher order differential equa- 
tions are not considered separately from first-order equations. By a change 
of variables, higher order problems can be reduced to a system of the form of 
(7.1). For example, consider the scalar IVP for the m-th-order scalar differ- 
ential equation: 


oe agg Ue) MO. OY Og Cees, 
y(a) = uo, y/(a) = a1, «YY (a) = Um-t. 
(7.4) 


We can reduce this high order IVP to a first-order system of the form (7.1) 
by defining x : [a,b] — R™ componentwise by 


x(t) = [er(t),r2(t),-- em)" =O. 9.9. gy" POI. 


z(t) = 22(¢), 

wp (t) = 23(t), uo 

ave ee with o(a)=| . |. (75) 
a',_1(t) = em(t), dea 

ah), = Olt ries yeas) 


3Recently, there has been some discussion concerning efficient methods that do consider 
higher-order problems separately. 
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That is, in this case f(t,2) is defined by: 


Example 7.1 

Consider 
y"(t) = y'(t) cos(y(t)) + e~%, 
y(0) =1, 
y'(0) =2 


In this chapter, we are concerned with numerical solution of (7.1), ie., 
we shall seek approximations to y(t), the solution of (7.1), at a discrete set 
of points to, ti, ---, tn € [a,b], where N is a positive integer. Although 
in practice the t,’s need not be equally spaced,* we will assume here that 
they are to simplify the presentation of the theoretical analysis. With that 
assumption, we define the step size or step or mesh length to be 


The nodes of the numerical scheme will be defined to be the equidistant points 


th =atkh, O<k<N, ie.to =a, ty =ath,::-, ty =band tey1-th = A. 
A numerical scheme will produce vectors yo, y1, -** ; Yn, Which will approx- 
imate the solution y(t) at t = t) = a,t = ti, t = te, ---, t= tn = Bb, 


4Modern software for differential equations, besides using methods in this chapter, employs 
adaptive techniques to adjust the distance between points, to achieve a required accuracy 
without excessive work. 
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respectively, i.e., yz © y(tx). (Henceforth, in this chapter, y, will denote an 
element of a sequence of vectors in R”, as opposed to the k-th component of 
a vector y. Similarly, f;, will be a vector in R” that denotes the k-th element 
of a sequence of values of f, as opposed to the value of the k-th component 


of f.) 


7.2 Euler’s method 


The simplest method we consider is Euler’s method (also called the Cauchy 
polygon method), given by the following iteration scheme. 


Yo = y(a), (7 6) 
Yr+1 = Vk thf (te, yr), O<k<N-1. , 


There are many ways of deriving (7.6). For example, approximating y’(t) 
by the forward difference formula, we obtain 


y(tk+1) — y(te) 
h 


Hence, y(tr4+1) © y(te) + hf (te, y(te)), which immediately suggests (7.6). 


~ y'(te) = f(te y(te)). 


Example 7.2 
Consider the scalar, i.e. n = 1, problem: 


ee =t+y, 
y(1) = 2. 


(The exact solution is y(t) = —t — 1+ 4e~‘e’.) Euler’s method has the form 


Yeti =VYethf (te, ye) = yr + h(te + yk), 
Yo => 2. to => 1. 


Applying Euler’s method, we obtain 
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The error for h = 0.1 at t = 1.1 is about 0.02 and the error for h = 0.05 at 
t = 1.1 is about 0.01. If A is cut in half, the error is cut in half, suggesting 
that the error is proportional to h. (] 


We will now study the convergence of the sequence {yx }o<x<n C R” to the 
solution y(t), a<t<b, where yz, © y(th) = y(at+ kh) = y(a+k(b—a)/N). 
We require the following lemma: 


LEMMA 7.1 
Suppose that there exist positive constants 6 and K* such that the members 
of the sequence do,di,---+ satisfy 


dn41 < dn (1 +6)+K*, n=0,1,2,--- (7.7) 
Then, 


dy Sd? PEP A S01 2s: (7.8) 


PROOF By recursively applying (7.7), 
dn < (1+6)"dp + K*(1+(14+6)+ (146)? +---+(1+6)"-4) 


((1 +6)" 1) 
= (14 6)"do + K* 
(1+ 6)"do : 
no 1 
< eS dy + K*e 
(after noting that 1+.6<e%=14+64+44---). Hl 


An error estimate, i.e., a bound on y; — y(t;), will now be obtained in the 
norm || - || in which the Lipschitz condition (7.3) holds. For this norm, we 
suppose that 


c2lltlloo S [lz] S calla |[oos |lalloo = max |ar| (7.9) 


for some c; and cz independent of x, since any two norms on R” are equivalent. 
We have the following result: 


THEOREM 7.3 
Suppose that f(t, y) satisfies the conditions of Theorem 7.2, so that there is a 
unique solution y(t) of (7.1) fora <t <b. Let L be the Lipschitz constant 
defined in (7.3), and let c, > 0 be the constant defined in (7.9). Suppose that 
y(t) exists and is continuous fora <t <b and that 
max ||y"(t)||o = M. (7.10) 


a<t<b 
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Let {yx }h_, be the discrete solution generated by the Euler method (7.6). Then 
we have the following error estimate. 


ci. M 
= & 
max, Ilys y(tk)|| <h OL 


(Note that the error is O(h).) 


gu dag aeela (7.11) 


PROOF We define the discretization error or global truncation error at 
step k by 
€k = Yk — Y(te), O<k<N. (7.12) 
If we assume the initial condition yo = y(t) is exact, then eg = 0. Then, 
Euler’s method gives 


Yeti =Ykt+hflteyr), OSkSN-1. (7.13) 


Consider now the solution y(t) = [y:(t), yo(t),--: ,Yn(t)]7. (Here, the sub- 
script “i” denotes a vector component, rather than an iterate.) By Taylor’s 
theorem, 

2 


velba) = alte) + hyf(te) + Sul) (7.14) 


for each i, 1 < i <n, where €? is some point in the interval [ty, t,41]. Denoting 
by vg the vector with components v} = y//(€*), we write (7.14) as 


ultesr) = ult) + F(t alte) + So. (7.15) 


Note by (7.9) and (7.10) that 
I|vil] < cal|velloo S$ 1M (7.16) 


for k =0,1,--- ,N. Subtracting (7.15) from(7.13) and using definition (7.12) 
we have 


h2 

enti = ek + ALF (te, Ye) — F(te, y(te))] — S-ee- (7.17) 

Taking the norm of each side and using (7.16) gives 

h2 
llee+all < llexll + AIF (tes ye) — F(ti, y(te))I + SeaM. (7.18) 
Using the Lipschitz condition (7.3), we obtain 
h2 

llen+all < (1+ AL)llexl| + aM, O0<k<N-I, (7.19) 


with ||eo|| = 0. Hence, using Lemma 7.1, by setting dy = |lex||, 6 = AL, 
K* = h?c,M/2, and using kh < b—a for 0 < k < N, we obtain 


M 
llexl| < here) =P (7.20) 
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REMARK 7.3 Note that the right side of (7.20) is of the form ch, where 
c is a constant independent of h. Thus, as h — 0, yx — y(t), ie., Euler’s 
method is convergent. It can be shown that the error estimate above is the 
best in the sense that the error is not of the form, for example, ch?. In 
practice, the actual errors ||e,|| of Euler’s method are usually smaller than 


that predicted by (7.20), ie., the constant aM [eh (b-a) — 1] is pessimistic. 
However, the linear behavior of the error is apparent, i.e., if the step size h is 
cut in half, then the error is generally reduced by about a factor of 4. 


REMARK 7.4 We have assumed so far that the calculations in Euler’s 
method have been performed exactly. Consider now the effects of rounding 
errors. First, assume that 

Yo = Yo + €0, (7.21) 


where we think of eg € R” as being “very small.” (Note that eo represents 
error in representation of the initial condition in the machine.) Second, assume 
that the rounding error in the calculation of f is 


f (te, Ge) = f (te, Ge) + ez, where e, represents roundoff error. (7.22) 
Finally, assume rounding errors occur when we multiply f by h and add to 
Yk, Le., 

Get = Ge +hf (tk. Gr) + pr, where pp represents roundoff error. (7.23) 
If we assume that, for all k, 
llexl] Se and |[pxll <p, 


then it can be shown® that 


maxo<k<n lx — y(tk)I| < cilleolleXO- 


a 7.24 
ter (SEP) ter gy. 


REMARK 7.5 _ If |leo|| =« = p = 0, then (7.24) reduces to (7.11). Also, 
note that e9 = 0 if yo is exactly representable in the machine. 


What is interesting about the bound (7.24) is that the rounding error be- 
comes unbounded as h — 0, i.e., as the number of intervals, and hence the 
number of calculations, goes to infinity. Indeed if we plot the total error, we 
obtain the graph of Figure 7.1. Thus, if we wish very accurate calculations, 


5in a manner very similar to proof of Theorem 7.3; you will do this in Exercise 1 on page 431. 
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error 


like 
(rounding error) Rke ch 
(discretization error) 


FIGURE 7.1: Illustration of rounding plus discretization error (Re- 
mark 7.5). 


then we may have to use extended precision® to reduce the constants €, p, and 
6. Of course, this increases the cost of the method. In subsequent sections, we 
shall seek more accurate methods, i.e., with error bounds of the form ch?, ch®, 
ch*, etc. However, the general picture of discretization error versus rounding 
error persists in these methods. 


7.3  Single-Step Methods: 
Taylor Series and Runge-Kutta 


The objective of this section is to derive higher order methods, i.e., schemes 
whose errors are O(h*), k > 1. In particular, we shall consider explicit Taylor 
series and Runge-Kutta methods, which form a basic class of explicit single- 
step methods for numerical solution of (7.1). 


An explicit single-step method (or one-step) method for numerical solution 
of (7.1) is a method of the form 


yo = y(to) 
oe = Ye + hO(ty, yeh), OSk<SN-1, (7.25) 


SAt the time of the writing of this work, “standard” precisions correspond to IEEE 754 
single precision and IEEE 754 double precision, as given in Table 1.1 on page 19. Extended 
precision corresponds to more bits in the mantissa than one of these two “standard” pre- 
cisions. For example, IEEE double precision uses 64 binary digits (bits) to represent the 
number, while IEEE quadruple precision, considered an “extended precision” uses 128 bits 
total. Computer chips are typically designed so that single and double precision computa- 
tions are much faster than extended precision. 
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where ® is a given R”-valued function defined on [a,b] x R” x [0, ho], for some 
constant ho > 0. 

The first example of a one-step method is Euler’s method (7.6), for which 
®(t, yn, h) = f(te, yx). Let’s derive another one-step method. 

For simplicity, consider n = 1. If y(t) is the exact solution of (7.1), then 


y(tesi) — y(t) = | "Hyd, O<RS NAL (7.26) 


Approximating the integral on the right side by the midpoint rule, we obtain 


| ee eae AF (1 i as ly = 5) . (7.27) 


Now, by Taylor’s Theorem, 


ult +5) © vltx) + S!(ts) = ulte) + SF(tesulie)). (7-28) 


By (7.26), (7.27), and (7.28), it is seen that y(t) approximately satisfies 


h 
y(te+i) © y(te) thf (4 +5.m) , O<k<N-I1, 
(7.29) 


with Ky = y(tx) + * F(tusu(te)); 


which suggests the following numerical method, known as the midpoint method 
for solution of (7.1). We seek yz, 0 << k < N, such that 


yo = y(to), ; 

Yjti =yj thf (1 a 3s) » 9 =0,1,2,---,N—1, (7.30) 
h 

Kij =yj5 + af (ti. y)- 


We can write (7.30) in the form: 


Yo = y(to) 
7.31 
{ Yjt1 = Yj + A&(t;, yj, h), ep) 


where 


h h 
D(t;,y;, h) = f(t; ly 9°45 an af (tis yj). 
Hence, the midpoint method is of the form (7.25). In addition, it is a Runge- 
Kutta method as defined later. Before continuing, some definitions are intro- 
duced. 
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DEFINITION 7.2 We say that a single-step method (7.25) has order of 
accuracy p, provided p is the largest integer for which 


y(t + h) — [y(t) + h&(t, y, h)] = O(h?*"), (7.32) 


where y(t) is the exact solution of IVP (7.1). 


DEFINITION 7.3 A single-step method (7.25) is consistent with the 
differential equation y'(t) = f(t,y) if ®(t, y,0) = f(t, y). 


We will see later that these definitions are useful. Let us investigate the 
order and consistency of Euler’s method and the midpoint method. 


7.3.1 Order and Consistency of Euler’s Method 
Euler’s method is consistent, since, clearly, ®(t, y,0) = f(t, y). Now let y(t) 
be the solution of (7.1). Then 


y(t + h) — [y(t) + h@(t, y, h)] = y(t + h) — y(t) — hf(t, y(t). (7.33) 
Applying Taylor’s Theorem, for 1 <i<n, 


2 
yi(t + h) = yi(t) + ryt) + “ul(&), & € [t,t +h], (7.34) 


where, here, y; it the i-th component of y. Suppose that y(t) is continuous 
for a<t< band let 
“Olle =AL 
ma l/"()l (7:35) 


By (7.33), (7.34), and using y’(t) = f(t, y(t)), we obtain 


2 


v(t +h) — (y(t) + A@(E,y, A) ac <M. (7.36) 


Thus, by Definition 7.2, Euler’s method has order of accuracy 1. 


7.3.2 Order and Consistency of the Midpoint Method 


For simplicity, let’s consider the scalar case n = 1. The midpoint method 
is consistent since, clearly, ®(t, y,0) = f(t, y). 
For notational convenience, define 


d(y(t),h) = y(t +h) — y(t) — h&(t, y(t), h), (7.37) 


where y(t) solves (7.1) for n = 1. The quantity d(y(t), h) is often called the 
local truncation error, and is a measure of the amount that the exact solution 
fails to satisfy the difference equation (approximate method) for one time 
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step. However, notice that in some texts d(y(t), h)/h is defined to be the local 
truncation error. 
Recalling Taylor’s Theorem in two variables,’ we obtain 


mut SFY) 


G(t, y(t), h) = f(t + 5 
hof(t,y) h Of (t,y) 
Oy 


+ O(h?), 


= fiyt+5 FE + 5féy) 


provided that f, ft, fy, fee. fyy, and fry are continuous and bounded for 
a<t<b. Hence, 


B(t,y(t),h) = Flty) + Sle + Fhulltsy) + OO) 
= f(ty) + 5 Eby) + OR) 
= l(t) + Sy" +008), 


Substituting this result into (7.37) and assuming that y’”(t) is bounded and 
continuous for a < t < b, we have 


A(y(t), h) = y(t) + hy! (t) + y(t) + O(h*) (7.38) 


—y(t) — hy! (#) — 5 v(t) + O88) = O10) 


Hence, if y’”(t) is continuous for a < t < b, we see that the midpoint method 
has onder p=2. 


REMARK 7.6 For a system (n > 1), it can be verified by Taylor series 
expansions of each component that the midpoint method has order 2. 


7.3.3 An Error Bound for Single-Step Methods 


We now derive an error bound for single-step methods (7.25). 


THEOREM 7.4 
Let y(t) be the solution of (7.1) and let fy} 9 be the numerical solution 
generated by method (7.25). Furthermore, aig that 


(a) ®(t,y,h) of (7.25) is a continuous function of its arguments; 


"We review multivariate Taylor polynomials on page 441. 
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(b) there exists a positive constant M such that 


for ally, y E R", h€ [0, hol, t € [a,b] (That is, ® is Lipschitz in y with 
Lipschitz constant® M.); 


(c) the method (7.25) is consistent and has order p > 0; specifically let 
d(t, y(t), h) = y(t + h) — y(t) — AE, y(t), h), 
and suppose there is a constant D, independent of h, such that 
I|d(t, y(t), h)|| < DhP* (7.40) 
for t € [a,b], h € [0, ho], where D may depend on y and f(t, y). 
We then have the following error bound: 


max lly — y(ty)|| < ch? (7.41) 


for0<h<ho, where c= D(ere~-a) —1)/M. 


PROOF By (7.25), we have 
Yeti = Yk + h®(tk, YR, h). (7.42) 
We also obtain 
y(tr+i) = y(th) + h®(te, y(te), h) + (te, y(te), h), (7.43) 


¥ 
from the definition of d(t, y(t,),h). Hence, letting ex = yx — y(th), O< Kk < 
N —1, (7.42) and (7.43) imply 


Ceti = eer + h[@(te, ye, h) — O(te, y(te), h)] — d(te, y(th), A). (7.44) 
This, with (7.39) and (7.40), in turn implies 
llex+all < Jlex||(1 + hM) + Dh?*. (7.45) 
We now apply Lemma 7.1 as follows: 
dn4i < dn(1+6)+ K* 
implies 


* 


K 
—(e™—1) with d,=lles||, 6=hM, K* = Dh?*. 


< nd 
as = doe + 5 


8Here, the norm can be any norm, although the exact value of M depends on the norm. 
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Now, noting e9 = yo — y(0) = 0, we obtain 
D 
llex|| < we —1)hP (7.46) 
for 0 < k <.N, which yields (7.41). U 


REMARK 7.7 Note that d(t, y(t), h) = O(h?t!) and |leg|| = O(h”), ie., 
the order of the “local discrepancy” or “local truncation error” of the method 
is one order higher than the error bound. 


REMARK 7.8 To achieve bound (7.41), sufficient smoothness assump- 
tions have been imposed on y(t), the solution of (7.1). If y(t) is not sufficiently 
smooth, then the method will not yield the high accuracy predicted by (7.41). 
(Smoothness is used to show (7.40); for example, see the derivation of (7.41) 
for Euler’s Method on page 391 and the derivation of (7.41) for the midpoint 
method on page 391.) 


REMARK 7.9 _ As an easy example, let us verify that conditions (a), (b), 
and (c) are fulfilled for the midpoint method (7.31), for which 


h h 
(tj, y;,h) al f(t; a 974i Alt af (ts ys)) 


(a) Assuming that conditions of existence-uniqueness Theorem 7.2 hold, 
condition (a) follows from these conditions. 


(b) Furthermore, we have 


h h 
h h 
=p + put 5ftt.9))|| 

h h 
< Llly+ gf lt, Naa hee aft all 

Gi Wks . é 
< Llly— gl + 52 ly— all s Mlly — gl, 

where M = L + L?ho/2. 


(c) Finally, we previously verified that (7.40) holds with p = 2; hence, the 
error in the midpoint method is O(h?). 
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7.3.4 Higher-Order Methods 


We now construct some single-step methods of high-order. Two types of 
single-step methods are Taylor series methods and Runge-Kutta methods. 


7.3.4.1 Taylor Series Methods 


Consider first Taylor series methods, which form one class of single-step 
methods. These methods are easy to derive. In the past, these methods were 
seldom used in practice since they required evaluations of high-order deriva- 
tives. However, with efficient implementations of automatic differentiation,® 
these methods are increasingly solving important real-world problems. For 
example, very high-order Taylor methods (of order 30 or higher) are used, 
with the aid of automatic differentiation, in the “COSY Infinity” package, 
which is used world-wide to model atomic particle accelerator beams. (See, 
for example, [13].) 

If y(t), the solution of (7.1), is sufficiently smooth, we see that 


h? hP 
y(tesa) = y(te) + hy'(te) + Sy" (te) +o + Fey M(te) FOC) (7.47) 


where, using (7.1), these derivatives can be computed explicitly with the mul- 
tivariate chain rule of the usual calculus. Thus, (7.47) leads to the following 
numerical scheme: 


yo = y(a) , : 
he d hP dP- (7.48) 
= hf (t aaeat(t eg rg 
Yr+i = Yr thf (tes ye) + 5 prea Ketle) ee 7 “paid § ks Yk) 
for k = 0,1,2,---,N—1. This is called a Taylor series method. (Note that 
Euler’s method is a Taylor series method of order p = 1.) 

By construction, the order of the method (7.48) is p. In weighing the 
practicality of this method, one should consider the structure of the problem 
itself, along with the ease (or lack thereof) of computing the derivatives. For 
example, with n = 1, we must compute 


d 505 Of 

a _ Of Of sO Fr OF OF afy\? 

gly) = oy + fam, tf we + ESE a (54) ; 
etc. 


If f is mildly complicated, then it is impractical to compute these formulas 
by hand;!° also, observe that, for n > 1, the number of terms can become 


°These implementations can be very sophisticated 
l0but this does not rule out automatic differentiation 
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large, although many may be zero; thus, an implementation of automatic 
differentiation should take advantage of the structure in f. 


Example 7.3 
Consider 
(1) = 2, 


ae f(y) =tt+y, 
y 
) = —-t—1+4e~te’. The Taylor series method of 


which has exact solution y(t 
order 2 for this example has 


f(y) =tt+y 


Pep aittty. 


and ‘ 24 


Therefore, 
2 


h- d 
Yeti = YR+hS (te, ye) + FT yl (ter ye) 


h2 
= yn + h(te + yn) + al +tp+ yr): 


Letting h = 0.1, we obtain the following results: 
k | tx | yx (Euler) | y, (T.S. order 2) | y(t,) (Exact) 


1] 1.1] 2.3 2.32 2.3207 
2 | 1.2 | 2.64 2.6841 2.6856 
3 | 1.3 | 3.024 3.0969 3.0994 
4] 1.4 | 3.4564 3.5636 3.9673 


7.3.4.2 Runge-Kutta Methods 


We now consider single-step Runge-Kutta methods, whose associated func- 
tion ®(, y, h) requires (possibly repeated) function evaluations of f(t, y) but 
not its derivatives. (Two examples are Euler’s Method and Midpoint Method.) 
In general, single-step Runge-Kutta methods have the form: 


yo = y(a) 
7.49 
3 = yp + h®(te, yx, h) vA) 


where 
O(t,y,h = Lox QT) 
ky = ft vy), 
r-1 


K, = f(ttarh,y+h >  bpsKs) 


s=1 


Initial Value Problems for Ordinary Differential Equations 397 


and 


r-1 
ay = Sete r=2,3,---,R. 
s=1 


Such a method is called an R-stage Runge-Kutta method. Notice that 
Euler’s method is a one-stage Runge-Kutta method and the midpoint method 
is a two-stage Runge-Kutta method with c, = 0, cz = 1, a2 = §, ba = 
Le., 


1 
3: 


h h 
Yer = Ye thf (1 + 57 Uk + f(t 4)) . 


We now consider in detail the general two-stage Runge-Kutta method, ie., 


Yroi = Yk theif (tes ye) + herf(te + ah, yx + a2hf (te, ye))- (7.50) 


Let’s see if there are other Runge-Kutta methods (2-stage) of order p = 2 
besides the midpoint method. For simplicity, let n = 1. By (7.32), we need 
to consider 


y(t + h) — [y(t) + her f(t, y) + hea f(t + azh, y + azhf(t, y))| 
= y(t +h) — y(t) — ha f(t,y) 
—heg [new + FID ash + oooh htt v| + O(h?) 


= y(t) +n seey) +S PAD 4 PED 5,9] vo 


—A(er + ca) f(t, y) — h?coae f(t,y) + O(n’) 


= O(n"), 


OPED. ay. OF (ED) 
gp OO ae 


if cy + co = 1 and cgag = 1/2. Letting co = a, then c; = 1— a, ag = 1/(2a), 
and we obtain a family of order p = 2 schemes, i.e., 


yo = y(to) 
7.51 
oe = yp + h®(te, ye, h), on) 


where 
h h 
B(t.uh) =O a)s(ty) af (t+ Sy Estey). 


When a = 1, this gives the midpoint method and when a = 1/2, this gives 
what is called the improved or modified Euler method. 

An analysis analogous to the second-order case can be performed for meth- 
ods of order p = 3 and order p = 4. The most well-known Runge-Kutta 
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scheme (from elementary numerical analysis texts) is 4-th order; it has the 
form: 


yo = y(to) 
Yeti = Yk + racisl +2Ko+2K3+ Kg] 
Ky = f (tk, yx) 
ko=f (1 ay aT + 3K) (7.52) 
K3 =f (1 ae 7 + +k) 


Ka = f(tk +h, ye + hK3), 


h 
O(ti, Ye, h) = ght +2Ko+2K3 + Ky]. 


Notice that in single-step methods, yxi1 = ye + hO(tK, ye, h), RO(tK, ye, h) is 
an approximation to the “rise” in y in going from t, to t, +h. In the fourth- 
order Runge-Kutta method, ®(t,, yz, h) is a weighted average of approximate 
“slopes” Ky, Ko, K3, K4 evaluated at tp, tp + h/2, th + h/2 and ty +h, 
respectively. 


Example 7.4 
Consider y/(t) = t+ y, y(1) = 2, with h = 0.1. We obtain 


Runge-Kutta order 2 | Runge-Kutta 


k|t, | Euler (Modified Euler) eae a y(t.) (exact) 
O| 1 2 2 2 2 

1] 1.1] 2.30 | 2.32 2.32068 2.32068 
211.2)2.64 | 2.6841 2.68561 2.68561 

3 | 1.3 | 3.024 | 3.09693 3.09943 3.09944 


U 


Higher-order Runge-Kutta methods are sometimes used, such as in the 
Runge-Kutta—Fehlberg method we introduce in section 7.4 below. 


7.3.4.3 Stability of Runge-Kutta Methods 


We now consider stability of Runge-Kutta methods. Suppose that at the 
k-th step, due to rounding errors, we actually don’t obtain yz, but we obtain 
zp such that yz, — z, 4 0. The error at the k-th step will influence subsequent 
values, so that at the N-th and final step, we have yy — zy 4 0, where yy is 
what would be obtained if we applied exact computations (without roundoff 
error) to the iteration equation (7.25), and zy is the actual computed result, 
assuming an error was made at the k-th step. 
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DEFINITION 7.4 We say that the Runge-Kutta method is numerically 
stable if there is a constant c independent of h such that 


lyn — zn || < cllyx — «|| for alk <N, (7.53) 
where h = (b—a)/N. 
Suppose that conditions (a) and (b) of Theorem 7.4 (our error bound the- 
orem, on page 393) are satisfied. Then it is straightforward to show that the 
method is stable. Considering j > k, we have 


2541 => ay + h®(t;, aay h). (7.54) 
Hence, 
Zj41 — Yjri = 2 — Ys + h[B(t;, 2;,h) — O(t;, y5, h)] (7.55) 
fork <7 < N—1 and thus, 
2541 — ytaill $< +kM)\l25-yl, RS ISN-1, 
from which we see that 


lew — yn] < A+ AM)NF Ize — yell 
<(1+AM)% |[z4 — yall < eM Ize — yr] = MO lI zn — yell 


= lle — Yel. 


Hence, Runge-Kutta methods, under the stated conditions, are stable in the 
sense that (7.53) is satisfied. This implies that an error ||z% — ys|| will not be 
magnified by more than a constant c at final time ty, i.e., “small errors” have 
“small effect.” 

The above definition of stability is not satisfactory if the constant c is very 
large. Consider, for example, Euler’s method applied to the scalar equation 
y’ = Ay, A= constant. Then Euler’s scheme gives y;+1 = yj(1+Ahk),0<j < 
N—1. An error, say at t = ty, will cause us to compute z;41 = z;(1+ AR) 
instead and hence |z;41 — yj41| = [1+ AR||z; — yj|, k < 7 < N—1. Thus, the 
error will be magnified if |1 + Ah| > 1, will remain the same if |1 + Ah| = 1, 
and will be suppressed if |1 + Ah| < 1. Consider the problem 


y’ = —1000y + 1000¢7 + 2t, O<t<1, y(0)=0, 


whose exact solution is y(t) = t?, 0 < t < 1. We find for Euler’s method 
that |z;41 — yj+1| = |1 — 1000A]|z; — y;|, 0 < 7 < N —1. The error will be 
suppressed if |1 — 1000h| < 1, ie, 0 < h < 0.002. Consider the following 
table: 
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0.99999900 
0.99999990 
0.99999999 


For h > .002, small errors are violently magnified. For example, for h = .01, 
the errors are magnified by |1 — +900| = 9 at each time step. However, note 
that the error bound given by Theorem 7.3, that is, 


e 1000 Soy! 


L(b—a) _ 1| ane, 995 2 


Mh 
“ g(i\\-e == 
lyw — 91) S$ Sle 
(M = 2,L = 1000,a = 0,b = 1,h = 0.01) is still valid, but the error bound is 
so large that it is practically meaningless. 
This discussion motivates a second concept of stability that will be impor- 
tant when we discuss stiff systems. 


DEFINITION 7.5 A numerical method for solution of (7.1) is called 
absolutely stable if when applied to the scalar equation y’ = ry, t > 0, it yields 
values {y;}j>0 with the property that y; + 0 as j — oo. The set of values rh 
for which a method is absolutely stable is called the set of absolute stability. 


Example 7.5 
(Absolute stability of Euler’s method and the midpoint method) 


1. Euler’s Method applied to y’ = Ay yields yj41 = yj;(1 + Ah), whence 
Yj => yo(1 + An)ITT, 
Clearly, assuming that ) is real, y; — 0.as 7 — oo if and only if |1+-Ah| < 
1 or —2 < Ah < 0. Hence, the interval of absolute stability of Euler’s 
method is (—2,0). 
2. The Midpoint Method applied to y’ = Ay yields 


Me 


h 
usta = ug + hACys + SAys) = yi + AR + ). 


Hence, y; > 0 as j > oo if |1+Ah+ A*h?/2| < 1, which for \ real leads 
to an interval of absolute stability (—2,0). (Other examples are given 
in the exercises.) 
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In general, explicit Runge-Kutta methods require significantly small h for 
accurate approximations of problems with large |X|. (Notice that the linear 
case with \ models the nonlinear case y’ = f(t, y) with Lipschitz constant 
L = X.) These methods should not be used for such problems or their sys- 
tem analogs (stiff system). We will consider later methods suitable for such 
problems. 


7.4 Error Control and the Runge—Kutta—Fehlberg Method 


Consider n = 1. Suppose that y(t) rapidly increases (or decreases) at some 
value of t, e.g., t = c. (See Figure 7.2.) Large time steps may provide high 


y(t) 


FIGURE 7.2: Sudden increase in a solution. 


accuracy in regions where y does not vary rapidly with t but in regions where 
y varies rapidly, small h may be essential to obtain the desired accuracy. 
However, the different regions may be unknown beforehand. 

It is straightforward to derive single-step methods with variable time steps. 
For example, with a = to < ty < +++ < tn_1 < tn = b, and h; = tj44 _ t;, 
0<j< N-—1, Euler’s scheme becomes 


7 = y(to) = y(a) 


yjti = ¥5 + hg f (tj, y,)- 


However, we would like our method to “sense” when the time steps hj = 
t;41 —t; should be decreased depending on the behavior of the solution y(t) 
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which is not known beforehand. Therefore, we would like to estimate the error 
committed by the scheme and reduce the time step if the errors are too large. 
Consider the general Runge-Kutta scheme 


Yjt1 = Yj + hy P(t;, yz, hj): 


The quantity y(t;41) — yj+1 is called the global error at time step t;. Unfor- 
tunately, y(tj;41) — yj41 is hard to estimate directly so we focus on the local 
error. 
To define local error, let y(t), t; < t < tj41, be the solution of IVP (7.1) 
and consider 
eon =f(t,ult)), ty) <t<tjy1 (7.57) 
u(t;) = yy. 
The local error at tj41 is defined by the quantity u(tj;+1) — yj41. The local 


error is thus a measure of the error committed by the numerical method in 
just a single step. Figure 7.3 illustrates this for n = 1. 


G(tj+1) 


(t;) u(t) ~ } local error at tj41 
Yj+1 


FIGURE 7.3: Local error in integrating an ODE. 


Since the global error = y(tj41) Yj = y(tj+1) — u(t;41) +u(tj41) —Yj+1; 
the global error is the sum of the local error and a quantity y(tj+41) — u(tj41) 
which measures the “stability” of the ODE y’ = f(t,y) in the sense that 
“small” deviations at t = t; should produce “small” effects at t = t; +h, for 
h; small as y(t) and u(t) are solutions of the same ODE with different starting 
values. 

Thus, we concentrate on estimating local errors. First, recall that the single- 
step method of order p has local truncation error 


d(t, y(t), h) = y(t + h) — y(t) — h®(t, y(t), h) = O(nP**), (7.58) 
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where y(t) satisfies the ODE. 
In particular, u(t) satisfies (by (7.57)) the ODE so 


d(t;,u(tj), hj) = u(ty41) — u(ty) — hj ®(t;, u(ts), hy) =O(RE™*). (7.59) 
In addition, 0= Yjt1 — Yi h; ®(t;, Ys hj) so 
u(tj+1) — Yj41 = O(n?**) (7.60) 


since u(t;) = y;. Thus, the local error of a method of order p is O(h?*"). 
Suppose now we have a second Runge-Kutta method of the form (7.49). 
Specifically, we assume that with the initial starting value y; we compute 


Ty = yy + hy®(ty, yy, hy) (7.61) 


by a method of order g > p+ 1 (so presumably 9,41 is more accurate than 
yj+1). For this order q method, we have that 


d(t;,u(t;), hy) = u(ty41) — u(t) — hy O(ty, u(t,), hy) = O(RE**). 


But since u(t;) = y;, combining this with (7.61) gives 


(ts) — yg = ORT) + G4 — yy. (7.62) 


Therefore, by (7.62), the local error can be estimated to O(h?*!) by 941 — 
Yj+i- 

Hence, by computing 7;+1 and y;41 at each time step we can estimate the 
local error and adjust the time step accordingly. One way of doing this is as 
follows. We assume that € > 0 is a given required tolerance and insist at each 
time step that 


oj-+1 — Yj+all S e(tj41 — t5) (7.63) 


If (7.63) is not satisfied, the time step is reduced, y;+1 and y;+1 are recom- 
puted, and the local error again estimated, etc. It can be shown (exercise) 
that if (7.63) holds and f(t,y) and ®(t,y,h) are Lipschitz continuous with 
respect to y then the global error satisfies 


Ily(t;) — wll S ce (7.64) 


for some constant c independent of € and h. Thus, this time step strategy 
controls the global error in the sense of (7.64). 

A popular error control method is the Runge—Kutta-Fehlberg Method. In 
the RKF method, we choose a pair (p = 4,q = 5) of Runge-Kutta methods 
that together only require six function evaluations per time step. These are 
based on the following function evaluations: 
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Ky = f (tj, y,) 5 
Ko =f (« 4 fy + 7K) : 
k3 =f (« 4 ou a eae + aka)) , 
K,=f (1 + ay + renee a - ee + sar ks)) ; 
K5 =f (1 +hyyjy + nok —8Ko+ aks - ame) ; 
Kg=f (« + fy + h( a + 2Ko sas + ake _ aK) : 
Then the 4-th-order method is: 
ys = Uy th Sok + So oKs + aKa sks) (7.68) 


and the 5-th-order method is: 


2 16 6656 28561 9 2 
Gj = 97 + ha + Ka-— Ks + = Ke 


——s mies 7.66 
135 12825 3 + 56430 50 59 x ) 


and the local error © 941 — yj41- 
A possible procedure for implementing the RKF method has the form: 


1. h given, € given. 
2. ti =ty+h 
3. calculate Yj+i1; Yj+1 


4. if llyj-+4 = Gj+1| < e(tj41 = t;), then output Yj+.s bjt and let j = j +1 


1 Zz 
tes (_—“ _) 
Ilyi+a = Y5+1l 


6. eifg<01, leth=O0.1h 
eif01l<q<4, letth=qh 
eifg>4, letth=4h 
eifh>Rmaz, let hh = hmax 


re go to (2) until tj44 >tyiae 
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Step (5) results from the following argument. We wish to determine gh so 
that using step size gh, 
laste — ar a! < 
> SER EE = O((gh)*) © kgth? = qthtk = = llyja1 — Gyll 
qh h 
Anges = he. 
Hence, q° < llys+a—Yo+ill” 
She 


Tusa = tal) 
control methods can be applied to other numerical methods for solving (7.1) 


such as extrapolation methods. 


Incorporating a “safety factor” of 3, q<( 1, Note that error 


7.5 Multistep Methods 


As an introduction to multistep methods, consider a simple example. Sup- 
pose that n = 1 and we use Simpson’s method to approximate 


y'(t)dt = ie f(t, y(t))dt, 


J J 


where y(t) solves (7.1). Then 


y(tj42) — y(tj) © 7 iltiva.ul tj+2)) + 4f (tj41, ulti) + £3, vty) 


from which the 2-step Simpson’s method follows: 


FFltrsasyise) +4 (tauren) +See) (7.67) 


In addition to requiring two starting values, yo and y; for Simpson’s method, 
yj+2 is given implicitly, since yj;+2 occurs on the right-hand side of (7.67). 
Thus, generally, a nonlinear system has to be solved at each step. 


Yjt2 = Yj + 


DEFINITION 7.6 A k-step multistep method for solution of (7.1) is a 
method of the form 


YO: Y1s°°* 5 Yk—-1 given starting values 
ARYjtk + Ok—-1Yj+k—1 + +++ + AY; (7.68) 
=h(Cafj+k + Be-ifj+e—1 + +++ + Bofy) 


for 3 =0,1,2,---,N—k or 


k k 
S- anyj+ = h Ss Ai fit 
1=0 1=0 
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for j = 0,1,2,---,N—k, where {az}i_9, {Gi}iig are given constants, inde- 
pendent of t or h, and fj4i = f (tj41, yj4i)- 


We assume for definiteness that a, = 1 and |ao| + |Go| > 0 so that we 
do indeed have a k-step method. Note that to start such a method we need 
values {y; ane which are generally obtained using yo = y(to) and the use of a 
single-step Runge-Kutta method for k — 1 steps. (A Runge-Kutta method of 
the same (or higher) order accuracy as the multistep method would be used.) 


DEFINITION 7.7 — If 8, = 0 the method is called explicit; if B, 4 0 
the method is called implicit, and requires solution of a nonlinear system to 
compute y;+k for each step. 


Example 7.6 

Euler’s method is a single-step explicit method (k = 1) with ay = 1, a9 = —1, 
3, =0, Go = 1. Simpson’s method is a two-step implicit method (& = 2) with 
ag 1, ay 0, Qo —1, Bo = 1/3, Bi = 4/3, Bo = 1/3. 


A frequently used one-step implicit method called the trapezoidal rule is 
given by 


1 1 
Yjti—Yy=h Gas + sf) (7.69) 


which is, of course, obtained by approximating the integral in 


y(tia) — yt) = ‘| p(t. y(t))at 


J 


by the trapezoidal rule. 

There are many ways to derive multistep methods of the form (7.68), e.g., 
by Taylor series, approximate integration, polynomial approximation, etc. A 
class of k-step methods of the form 


k 
Uitk — Ujte-1 = BD) Arfyat 
1=0 

is the class of Adams methods. In particular, Adams—Bashforth methods have 
Bry = 0 and Adams—Moulton methods have 6; 4 0. A classical work in which 
the derivation of these methods is clearly explained is [84]. 

We now consider convergence, stability, and consistency of multistep meth- 
ods. 


DEFINITION 7.8 = The k-step method (7.68) is convergent to solution 


y(t) of (7.1) af 
ra y(t) (7.70) 
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for all t € [a,b], provided that lim Yj; = Yo for0<j<k-1. 


Associated with the multistep method (7.68), we define a linear operator £L 
(for n = 1) by 


k 
= SVlary(t + Ih) — hry’ (t + 1h). (7.71) 
l=0 


Expanding the right-hand side in a Taylor series about t, we have 
Lly(t); h] = coy(t) + erhy'(t) + eah? y(t) +--+ + eghty"(t) +--+, (7.72) 


where co, C1, C2, °*+, Cq are constants depending on the coefficients a;, 3;, 
0< j3<>k. These constants are given explicitly by 


co = SS Qi, 
1=0 
k k 
a=) im- >> &, 
l=1 1=0 
1 1 
q a + 2409 4 hax) — 1)! (G1 +297" By +++» + k9-* By), 
(7.73) 
for q = 2,3,--- 


With this, we give an alternate definition of order of accuracy, for multistep 
methods. 


DEFINITION 7.9 The k-step method (7.68) is said to be of order of 
accuracy p if in (7.78) c= c1 = © Cp = 0 but ch41 £0. 


To see why this definition is reasonable, consider the local error for multistep 
methods. For simplicity, consider the general two-step method and n = 1: 


Q2Yj42 = —O1Yj+1 — A0Y; +A(Gof (tj+2, yj+2) + if (ty41, ¥j41) + of (tj, y5))- 
Let yj+41 = y(tj41) and y; = y(t;), that is y; and y;41 are exact. Let 


A29j42 = —ary(tj+1) — aoy(t;) 


+ h(Bof (tj+2, Uj+2) + Of (ty41, y(ti41)) + Bof (tj, y(ts))), 


so the local error is 


dj = |y(tj+2) — Gal: 
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By reasoning similar to that used for one-step methods, if dj = O(h?*"), then 
the global error should be O(h?). Therefore, consider 


dj = |y(tj+2) — 9542 

— hBof(tj+2,95+2) 

I 

— hBof (tj+2, y(tj+2)) 
—hBif (tj41, y(tj41)) — ABof (tj, y(ts))| 

+ laBall#(tseavultse0)) ~ ftiea Tonal 


= jlooultiea) + erultyea) + aout) 
—hfi f (t541, y(ti41)) — hBof (tj, y(ts) 
) 

ults) 


1 
< E Joa o2uee j+2) + ary(tj41) + aoy(t; 


Hence, 


1 
dj < — Taal jlezu( tj42) + ary(tj41) + aoy(t;) — hGay' (t;42) 


—hBry' (t)41) — hBoy'(t;) jae] a 


assuming that f satisfies a Lipschitz condition. Thus, 


(1— AL|2[)d < Plaaylts2) + axvltex) + aoults) ~ hav’ (tsa) 
—hBry'(tj+1) — hBoy' (ts)| 


(We assume that 1 — hoL|2| > 0 where 0 < h < ho.) Expanding in Taylor 
series, 


2 1 
2 3 
ful) + ¥'e)2h + "(SE + oe) 
taaly(ts) +u/(tyh +o") + ---] + aoult) 


h2 
—hBrly'(ty) Fy (tsa + 9 (ty) > +] — Abou" (ts) | 


= (a2 +a; + ao0)y(t;) + (Zaz + a1 — Bo — Bi — Bo)hy'(t;) 
(2a2 4 5m 282 — Bi)h?y" (t;) 
8 1 


(02 5% 282 


1 
5 Au yhey' (ts) ass, 
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But the right-hand side of the above inequality is exactly 
coy(t;) + crhy'(ty) + egh?y" (tj) + cgh?y" (ty) ++ 
where c;,i = 0,1,2,--- satisfy (7.73). Specifically, 


Co = A2 +a, + A0 


C1 = 2a2 + a1 — 82 — Pi — Bo 
1 


C2 = 3 (daz + a1) — (282 + 81) 
3 = = (803 +a4)— 5(4B2 + 61) 


In fact, by examining our argument, the local error satisfies 


dy < ylelults) Al + Colaba, 
In general, 
ed; < —|Lly(t;), All < O(n”) 
lor | 
if co = 1 = & = +++ = cy = 0 but epi ¥ 0, where c = 1— Be Bx lL > 0 for 


0 <h< ho and sufficiently small ho. 
Thus, for sufficient smoothness of the solution y(t), the local error is O(h?**) 
which implies that the global error is O(h?). 


Example 7.7 

It is straightforward to see that Euler’s method is order p = 1, (This agrees 
with the analysis for one-step methods.) Specifically using (7.68) and (7.73), 
a, = 1, a = -1, & = 0 and 6 = 1 for Euler’s method. Thus, cp = 
ay + a9 = 0, cr = a1 — 61 — fp = 0 and cp = $a; — $61 = 4 #0. Hence, by 
Definition 7.9, Euler’s method has order of accuracy p = 1. 


Consider the one-step implicit Trapezoidal method. For this method, a; = 
1, a = —1, & = 1/2, and Go = 1/2. Hence, oc = ai +ao = 0, a = 
ay — By — By = 0, co = $01 — FH, = 0, C3 = 401 — $91 =F -FH-FHFO. 
Thus, the Trapezoidal method has order of accuracy p = 2. In addition, it is 
straightforward to show that the two-step implicit Simpson’s method (defined 
by Equation (7.67) on page 405) has order p = 4 (Exercise 21 on page 435). 

We now present an alternative formulation of consistency for k-step meth- 
ods. 


DEFINITION 7.10 = The k-step method is consistent if it has order p > 1. 
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It follows that method (7.68) is consistent if and only if co = c, = 0 in 
(7.72) and (7.73), that is if and only if 


k 
Sia =0 and Solaa=S > (7.74) 


REMARK 7.10 It is straightforward to show that this agrees with 
our earlier definition of consistency for explicit one-step methods. (Consider 
P(t;,4;,h) = Bof (tj, yj) with J, = 0.) 


In addition, we associate the following polynomials p(z) and o(z) of a single 
complex variable z with multistep methods. 


k 
i) =m Rewt-ey ton pap SS wa? 
ioe (7.75) 
o(2) = Bo + Biz t Boz? +++-+ Bpz® = fz! 
1=0 
It follows that the k-step method (7.68) is consistent if and only if 
p(l) = 09, 
7.76 
ay = o(1). et) 


We are now going to define stability, but first we motivate the definition. 
Consider the scalar initial-value problem y'(t) = 0, t > 0, y(0) = 0 whose 
unique solution is y(t) = 0. Consider obtaining an approximate solution by a 
simple two-step method 


Yj+2 + O1Yj+1 + aoyj = 0, J = 0, 1, 2, ia (7.77) 


since a2 = 1 and f(t, y) = 0. Equation (7.77) is a difference equation whose 
solution can be found by substituting y; = 27, 7 = 0,1,2,--- for some complex 
number z. The equation so obtained is 


2 +oayz+a9 =0. (7.78) 


Since p(z) = 27 + a1z + ao, we see that z is a zero of p(z). Now suppose 
that (7.78) has distinct roots z; and z2. (If the roots are the same, then 
yj = Ayhzi + jAghz3.) Then 


yj = Arhz} + Aghzd (7.79) 


is a solution of (7.77), where A; and A» are constants. 
To find A, and Ag, we note that yo = y1 = 0 which gives 


eae 0, 


Ay hed Ashes =O; (7.80) 
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from which A; = Ag = 0. Hence, (7.79) indeed gives y; = 0, 7 = 0,1,2,---. 
However, if (7.79) is solved numerically for A; and Ag or if y,; 4 0, but is very 
small (because y; may be obtained from yo using some single-step method in 
the presence of rounding errors) then nonzero A; and Ag may be obtained. 
Now, if |z1|, |z2| < 1, we see that y; - 0, as j ~ c,h 0, jh=t > 0, ie, 
the method is convergent (to solution y(t) = 0), as it should be. However, if 
|z1| > 1 or |zo| > 1, then |y;| > co as 7 — 00, Ah 0, jh =t > 0; the method 
is therefore not convergent in that case. Indeed, small errors are violently 
magnified as h — 0. (Note, in the case z1 = zg, the solution (7.79) of (7.77) 
has the form y; = Ajhz{ + jAghz{ and strict inequality |z1| < 1 must hold 
for convergence.) 
We now define stability. 


DEFINITION 7.11 = The k-step method (7.68) is said to be stable if the 
following conditions hold: 


(a) all roots z;,1 <j <k, of p(z) =0 satisfy |z;| <1, 


(b) if a root z, is not a single root, then it satisfies |z.| <1. 


Example 7.8 

It can be shown, for example, that Euler’s method, the trapezoidal method 
and Simpson’s method are all stable. Consider Simpson’s method. For Simp- 
son’s method, ag = —1, a, = 0, ag = 1, and p(z) = z?7—1. Hence, z = 1, 
Zo = —1, and |z;| <1 for 7 = 1,2. 


Before continuing, a brief review may be helpful. 
e A k-step method has the form 
k k 
Slay =P f(t yi), F=0,1,2,-°,N—-k (7.68) 
1=0 1=0 
where Yo, Y1,°** ;Yk—-1 are given. (Explicit if 6, = 0, implicit if G, 4 0.) 


e The multistep method is consistent if the order p > 1 and thus if 


k k k 
ye a, =0 and Sola = yo By. (7.74) 
1=0 i=1 l=1 
e If 
k k 
p(z) = DS az and o(z)= S- By2', (7.75) 
1=0 1=0 


then the method is consistent if p(1) = 0, p’(1) = o(1). 
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e Furthermore, the method is stable if all roots of p(z) = 0 satisfy |z;| <1 
with strict inequality if z; is not a single root. 


We now have the following important theorem (due to Dahlquist) connect- 
ing consistency, stability, and convergence. 


THEOREM 7.5 
For a multistep method of the form (7.68) to be convergent, it is necessary 
and sufficient for it to be consistent and stable. 


PROOF We prove here only necessity for n = 1 although the theorem 
holds for n > 1. (In place of sufficiency, a stronger result is given in Theorem 
7.6.) 

If a method is convergent, then it is convergent for the initial-value problem 
y(t) = 0, t > 0, y(0) = 0 whose solution is y(t) = 0, t > 0. The k-step method 
for this problem is: 


Yjtk + Ok-1Yj+h-1 +++ +Q04j =0, J 20. (7.81) 
Since the method is convergent, for any t > 0 we have 


lim y; = 0, (7.82) 
W ha®,2) 


where h = t/j, whenever 


lim yj =0, O<j<k-1. (7.83) 


Now let 7 = re’’, r > 0, 0 < vy < 27 be a root of p(n) = 0 and consider the 
numbers yo, Y1, y2,°°* given by 


yj = hr! cos(je) (7.84) 


(recall that 7 = rJe%" = r) cos(jy) + ir) sin(jy).) These numbers satisfy 
difference equation (7.81) and also satisfy (7.83). By hypothesis, (7.82) must 
hold for these y;. If gy = 0 or y = 7, then (7.82) implies that r < 1. In 
case yp #0, y #7, note that yF — yj41yj-1 = h?r7d cos*(jy) — h?r7d cos((j + 
Yj — Yj+1Yj-1 
sin?(y) 

Now the left-hand side > 0 as 7 > 00 by (7.82). This implies that h?r?7 — 0 
as j > 00, h =t/j from which r < 1. We thus obtain r < 1, property (a) of 
the stability condition. 

Now let 7 = re’? be a root of p(7) of multiplicity greater than one. Now 
y; = Vh jr cos(jy) forms a solution of difference equation (7.81) when 7 is 
a multiple root of p(7) = 0 and these numbers also satisfy (7.83). If y = 0 
or y = 7, we obtain |y;| = Vh jr? = Vij ri > 0 as j — oo for fixed 


1)y) cos((j — 1)y) = h?r?) sin?(y). That is, Shr. 
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t > 0 only ifr < 1. (Recall that t = hj.) If py 40, »y #7, we use relation 
22 — 24412;-1 = 77) sin?(~), 2) = y;/(jVh) to obtain the same conclusion 
that r < 1. We have thus shown that convergence implies stability. 

To see that convergence implies consistency, we have to show that conver- 
gence implies that p(1) = 0, p’(1) = o(1) or that co = c, = 0. To show that 
co = 0, consider the initial-value problem y’(t) = 0, y(0) = 1 with exact solu- 
tion y(t) = 1. Assuming starting values y; = 1, 0 < 7 < k—1, we conclude 
that y;, solution of (7.81), must satisfy ae n Yj = 1,jh =t > 0 fixed. Letting 


j — oo in (7.81), we obtain that p(1) = 0 = co = > a,. To obtain c; = 0, 


we consider the initial-value problem y’(t) = 1 ne "0, (The exact solution 
is y(t) =t.) Thus, y; — jh as j — oo and f(t;,y;) = 1 for all 7. Then, 
considering (7.68), 


an(j+k)ht+ orilgtk—-Dh+---+ao0(jh) = h(B_ + Be-1 +--+ + Go). 
Rearranging this expression, 


apkh+ap—1(k—-Wh+-+-+jh(an+ap—1t+:+:+a0) = h(Ge+Gr-1+---+ 60). 


Hence, p’(1) = o(1), because (ay, + ax—-1 +++: + Q0) = 0. 


Consider what happens when a method fails to be consistent or fails to be 
stable. 
First consider the method: 


yj+2 + 4y541 — Sys = h(4fj4i + 2f;) (7.85) 


This method is consistent with order p = 3 but it is not stable, i.e., the 
roots of z2 + 4z—5 =0 are z = 1 and z = —5. By Theorem 7.5, it is not 
convergent. Apply the method to y’ = 4t,/y, 0 < t < 2, y(0) = 1 with solution 
y(t) = (1+ t?)?. Using exact starting values in (7.85), yo = 1,y1 = (1+ h?)?, 
we obtain with h = 0.1, 


t 


exact solution [1 | 1. 082 1. 316 —. 000 =. 884 


numerical solution | 1 


2.0 
25.000 


The errors are oscillating wildly. 
Now consider the following method: 


h 
Yjr2—-Yju = 3 (3 fit1 — 2f;) (7.86) 


This two-step method is stable but it is not consistent. We have p(z) = 0-—z+ 
27, o(z) =—-3+424+027, p(1) = 0, but p'(z) = —14-2z, p’(1) =1 F o(1) = §. 
For the same iaalaalne problem as above, we obtain: 
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As the step size decreases, the error increases rather than decreases, al- 
though the violent behavior, characteristic of unstable methods, is absent. 

It is of course desirable to use a stable method with high order p. (With 
high-order accuracy, the number of time steps can be reduced to decrease 
rounding errors.) We have the following error estimate for multistep k-step 
methods. 


THEOREM 7.6 

Suppose that the k-step method (7.68) is stable and of order p> 1 (and thus 
consistent). Let the conditions of Theorem 7.2 be satisfied and let y(t), the 
solution of IVP (7.1) (forn =1) be p+1 times continuously differentiable on 
[a,b]. Then there is an ho > 0 such that 0 <h< ho and the estimate 


alts) < AEs P (p+) ; 
jee a USL, mmr. uly PnP mips: G8) 


holds for some positive constant c independent of y(t), y;, and h. 


PROOF See [90]. 


We now turn to absolute stability. A numerical method may be stable 
and consistent (hence convergent) but when applied to certain problems may 
require step size h too small for practical consideration. (See the example 
preceding Definition 7.5.) Small values of h will also lead to larger rounding 
errors. Our definition of absolute stability applied to multistep method is: 


DEFINITION 7.12 = The multistep method (7.68) is said to be absolutely 
stable if applied to the scalar equation y’ = Ay, t > 0, yields values {y;}j>0 
which satisfy yj ~ 0 as 7 — oo. The set of values Ah for which y; — 0 as 
j — co ts called the set of absolute stability of (7.68). 


We have the following result: 


PROPOSITION 7.1 
The k-step multistep method (7.68) is absolutely stable provided that the roots 
23,1 <j <hk, of the polynomial p(z, Ah) = p(z) — Aho(z) satisfy |z;| < 1. 
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PROOF Applying the k-step method (7.68) to y’ = Ay, we obtain 


k 


S (aa — hABi)yj41 = 0, O<j<k-1. 
l=0 


The solution of this difference equation has the form: 


k 
Yj Lm m? 
m=1 


where for each m, 1<m<k. 


k 


dolar = Ahi) z 


1=0 


k 
Thus, we require the roots of p(z, Ah) = p(z) — Aho(z) = >> (a1 — AhB)z! to 


1=0 


satisfy |z;|<1,1<j<k. 


Example 7.9 
(Extremes) 


(a) Midpoint Multistep Method has the form 


Wa 


Yjt2 — Yj = 2hfj+1 


and thus Bo 0, By 2: Bo 0, 00 —l, Oo, = 0, 02> 1. Since k = 2, 
p(z) = -1+4+ 2? and o(z) = 2z. (Notice first that the method is stable 
and consistent and p(1) = 0, p’(1) = o(1), and the roots of p(z) = 0 
are simple with magnitude < 1.) We have p(z, Ah) = —14 27 — 2Ahz, 
which has roots z = Ah + \/(AR)? +1, so z= A+ (AR)? +1 > 1 
for any Ah > 0. Consider Ah < 0. Then zg = Ah — \/(Ah)2? +1 < -1 
for any Ah < 0. Thus, the midpoint method is not absolutely stable for 
any real Ah. 


y 


aN ee 


The Trapezoidal Method (7.69) has the form 
1 1 
yt — Yj = A(S fit + of) 
2 2 
and hence p(z) = —1+z ae ae ) = $(1+ 2), and p(z, AR) = -14+2- 


AR($(1+z)) has root 21 = (Ah) < 0, we see that 


|z1| < 1. The set of ates stability for the trapezoidal method, if is 
complex, is the open left-half complex plane, and if is real, the interval 
of absolute stability is (—oo,0). (Note that the trapezoidal method is 
consistent and stable as p(1) = 0, p’(1) = a(1), and the roots of p(z) 
are simple with magnitude < 1.) 
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REMARK 7.11 Multistep methods have the advantage over Runge— 
Kutta methods of providing high accuracy with few function evaluations. For 
example, the four-step explicit Adams-Bashforth method has the form 


Vita = Yirstoy “ 7 PSF (t+, yi+3)—59F (542, ¥j42) T37F Ei +1 yt) 9F (ts 99) 

(7.88) 
where yo = y(to) and y1, y2, ys are obtained using a 4-th-order Runge-Kutta 
method. This method only requires one new function evaluation per time 
step, namely, f(t;+3,y;+3). (Recall that the 4-th-order Runge-Kutta method 
requires 4 new function evaluations per time step.) 


7.6 Predictor-Corrector Methods 


We now consider predictor-corrector methods. Consider an implicit method 
of the form: 


k-1 k-1 
yitk + >> cyst = PBS (tite, ytke) +2 > Bifyse (7.89) 
l=0 l=0 


where (3; # 0). A simple iterative scheme for solution of (7.89) is to compute 
the sequence ee s > 0 defined by 


(0). soe 
Yjtk = ee 


: (7.90) 
ie 1 S anyj+t = heft jks 1 a) cL ie Pifj+i 
1=0 1=0 
for s = 0,1,2,---. Provided that Lh|G,| < 1, this fixed-point iteration se- 


quence {y\ }a0 converges to y;+x, the exact solution of nonlinear system 


(7.89). If te is near y;+%, then the iterations may converge rapidly. 


A way to predict rae accurately is to use an explicit multistep method as 
a predictor, then use the implicit method (7.89) as a corrector. A predictor- 


corrector method has the form: 


Predictor P: We + 3 aT Yj+l = De GBF fit (7.91) 


k-1 


Corrector C: ySX? + S aryj+t = hGx f(t Soe Net hS > Afj+1 ( (7.92) 
1=0 1=0 
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for s =0,1,2,---. 
The ete arises as when do we stop the iterations. One could correct 
until we? - i < e. However, usually (7.91) is applied once and (7.92) 


is applied m times (generally m = 1 or m = 2). 
Some examples of predictor-corrector pairs are: Euler-Trapezoidal Method: 


(P) yj41 — yy =A; (p= 1) 


7.93 
(C) yj — yy = * (fiti + fi) (p= 2) 


Adam’s Method: 


(55fj+3 — 59fj40 + 37fj;41-9f;)  (p=A4) 


(9fj+4 + 19fj43 —5fj42+ fi41) (p= 4) 
(7.94) 
Adam’s method uses a 4-step Adams-Bashforth method as the predictor and 
a 3-step Adams-Moulton method as the corrector. 
Example of Euler-Trapezoidal Method 


Consider 
{ y'(t)=t+y 
y(1) =2 


(P) Yjt+4 — Yj+3 = 


h 
a5 
(C) Yjt+4 — Yj4+3 = ov 


with exact solution y(t) = —t —1+4e~te’. Let h = 0.1 with yo = 2 and 
to = 1. Let m= 1 (one correction). Then, for this problem, 


0 
ran = yj + h(tj + yj) 
0 
yyaa = 9y + E(t +91) + A(t + yy) 


The following numerical results are obtained: 


ji | ti [yi | ys) 
O} 1 2 2 
1] 1.1 | 2.32 | 2.32068 


Some properties of Predictor-Corrector pairs are now summarized. 


(i) Order of accuracy 


Let p* be the order of accuracy of the predictor and p the order of 
accuracy of the corrector. If we correct m > 1 times and if p* > p—1, 
then the order of accuracy of the PC pair (7.91)-—(7.92) is p, the order of 
the corrector. For example, the Euler—Trapezoidal pair with p* = 1 and 
p = 2 will have order p = 2. For Adam’s method, we have p* = p = 4 
and hence the predictor-corrector pair has order p = 4. 
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(ii) Stability 
The stability properties of the PC pair (7.91)—(7.92) are the same as 
those of the corrector. (Thus, we obtain the generally superior stability 
properties of the implicit corrector method.) 


Sa 


(iii) Absolute stability 

Absolute stability of the PC pair generally depends on m and PC pair. 
For the Adam’s pair, the interval of absolute stability is (—3,0) if we 
“correct to convergence,” i.e., as m — oo. If m = 1, the interval of 
absolute stability for the Adams pair is (—1.25,0). (Interval of absolute 
stability for the Euler-Trapezoidal pair is (—2,0).) 


REMARK 7.12 As in the Runge-Kutta method, it is possible to devise 
strategies to control local error by varying step size. Good programs are 
available based on predictor-corrector methods with variable step size and 
estimation of local error. 


REMARK 7.13 = Tabulated below are intervals of absolute stability of 
several methods. 


Interval 
Fourth-Order Runge-Kutta —2.78,0 
Adams—Bashforth, Adams—Moulton PC(m = 1) | (—1.25,0) 
Adams-Bashforth, Adams—Moulton PC(m = oo) | (—3.00, 0) 
Gragg’s Method (extrapolation) (—3.10, 0) 
Trapezoidal Rule (solved exactly at each step) (co, 0) 

7.7 Stiff Systems 
Consider the initial-value problem y : [0,co) — R” such that 
y(t) = Ay(t), t20, y(0) = yo. (7.95) 


4. 2 


For now,!! we will assume that A is an n x n matrix with simple eigenvalues! 
ri, 1 <i <n, with corresponding eigenvectors x;,1 <7i<n. It is well known 


11We will consider the simplest case here. 
12that is, with n distinct eigenvalues and, hence, with n linearly independent eigenvectors 
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that the explicit solution of (7.95) is 
ull = > eens (7.96) 
i=1 


where c;, 1 <i <n are constants found by letting t = 0 in (7.96) and solving 


nm 
Yo = S Ci x4 
i=1 


for the c,’s. (You will derive this in Exercise 31 on page 437.) 

The term stiff system originated in the study of mechanical systems with 
springs. A spring is “stiff” if its spring constant is large; in such a mechanical 
system, the spring will cause motions of the system that are fast relative to 
the time scale on which we are studying the system. In the numerical solution 
of initial value problems, “stiffness” has come to mean that the solution to 
the ODE has some components that vary or die out rapidly in relation to the 
other components, or in relation to the time interval over which the integration 
proceeds. For example, the scalar equation y’ = —1000y might be considered 
to be moderately stiff when it is integrated for 0 <t <1. 


Example 7.10 
Let’s consider a stiff system 


y = Ay, ¢>0, y(0) =(1,0,-1)" (7.97) 
where 
21 19 —20 
AS | 49.291. 90 
40 —40 —40 


The eigenvalues of A are \y = —2, A» = —40+ 407 and A3 = —40 — 407 and 
the exact solution of (7.97) is 


1 1 
yi(t) = a + a (cos 40¢ + sin 402), 

1 1 
y2(t) = ae - (cos 40t + sin 402), (7.98) 
y3(t) = — e~** (cos 40t — sin 402), 


with graphs as in Figure 7.4. Notice that for 0 <t < 1, y(t), 1 <7 < 3, vary 
rapidly but for ¢ > 0.1, then y; vary slowly. Hence, a small time step must be 
used in the interval [0,0.1] for adequate resolution whereas for t > 0.1 large 
time steps should suffice. Suppose we use Euler’s method starting at t = 0.2 
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FIGURE 7.4: Exact solution for the stiff ODE system given in Example 
7.10. 


with initial conditions taken as the exact values y;(0.2), 1 <7 <3. We obtain: 


For h = 0.04 


O | 0.2 | 0.3353 | 0.3350 | -0.00028 
5 | 0.3 | 0.2734 | 0.2732 | -0.000065 
10 | 0.4 | 0.2229 | 0.2228 | -0.0000054 


Violent instability occurs for h = 0.04 but the method is stable for h = 0.02. 
What happened? Why do we need fh so small? 


The answers lie in understanding absolute stability. 
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7.7.1 Absolute Stability of Methods for Systems 


Earlier (Definition 7.5 on page 400), we defined absolute stability of methods 
for solving the IVP in terms of the scalar equation y’ = Ay. We now extend 
the definition to systems. 


DEFINITION 7.13 Let A satisfy the stated assumptions and suppose 
that RedA; < 0, 1 <i <n. A numerical method for solving IVP (7.95) is 
called absolutely stable for a particular value of the product Ah if it ytelds 
numerical solutions, yj, 7 > 0, in R” such that y; — 0 as 7 — o for all 
yo. As in Definition 7.5, we speak of the region of absolute stability as being 
the set of Ah in the complex plane for which the method, applied to a scalar 
equation y’ = Ay, y ER, is absolutely stable. (The reason this will make sense 
for systems will become apparent below.) 


Notice that a method for a system is absolutely stable if and only if the 
method is absolutely stable for the scalar equations z’ = A;z, for 1 <i <n. 
To see this, consider, for example, the k-step method 


k k k 
Says =hS— Bifiss =h > Br Amis. 
1=0 1=0 1=0 


Thus, 


k 
yf al — hG,A) \yi4j = = 0, j < 0. 
1=0 


Let P~!AP = A, where this decomposition is guaranteed if A has n simple 
eigenvalues and A = diag(A1, A2,--: , An). We conclude that 


k 
Sai (aI — hGB,A) P7 ig = =0. 
l=0 


Setting z; = P~'y,;, we see that 


k 
dolar —hBri)(Zi4j)i =0, Lin, 


1=0 


where (zj4;); is the i-th component of zj4;. Since (z;); ~ 0, 1 <i<n,as 
j — oo if and only if y; — 0 as j — oo, we see that the method will be 
absolutely stable for system (7.95) if and only if it is absolutely stable for 
the scalar equation z’ = \yz, 1 <i <n. In this case, it will be absolutely 
stable provided that the roots of p(z, h;7) = p(z) —hAjo(z), 1 <i <n, satisfy 
lui] <1, 1<l<k,l<i<n. 
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For Euler’s method, we found that the region of absolute stability is the 
open disk 


{Ah : |L + AR| < 1} (7.99) 
(See Figure 7.5.) (Recall yj;41 = y; + Ahy,; for Euler’s method applied to 
Im 


Ah-plane 
Re 


FIGURE 7.5: The open disc representing the stability region for Euler’s 
method. 


y’ = ry gives y; — 0 if |1 + Ah| < 1.) 

Applying this reasoning to Example 7.10, we see that, for the numerical 
solutions to go to zero (that is, for absolute stability), we must have |1+);h| < 
1,1<7i< 3. Fori =1 (Ai = —2), this yields h < 1. However, i = 2,3 
(Az = —40 + 402, Az = —40 — 407) yields h < 1/40 = .025 which is violated 
if h = .04. We conclude that, although the terms with eigenvalues 2, A3 
contribute almost nothing to the solution of (7.97) after t = .1, they force the 
selection of small time step h which must satisfy |1+ A2h| < 1, |L+A3h| <1. 


7.7.2. Methods for Stiff Systems 
Recapitulating, we have 


DEFINITION 7.14 = The linear system (7.95) is said to be stiff if Rex; < 
0,1 <i<n, and |Red,| = max |Rer;| > |Red,| = ymin |ReA;|, or if 


max |ReA;| is large in relation to the length b—a of the interval of integration. 
<i<n 


REMARK 7.14 Note that the stiff system (7.95) is a model for nonlinear 
systems y’ = f(t, y) with large Lipschitz constant L. The matrix A is a model 
for the Jacobi matrix Of /Oy, i.e., expanding in a Taylor series about fixed 9, 


Hy) © G8) + SH Dly a 
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We now study selection of methods suitable for nonlinear stiff systems. 
First, 


DEFINITION 7.15 A numerical method is called A-stable if its region 
of stability contains all of the open left-half plane Re(Ah) < 0. 


REMARK 7.15 Note that, if a method is A-stable, then the numerical 
solution based on that method will go to zero (and thus behave qualitatively 
like the true solution) regardless of how we choose h. 


The trapezoidal method 


h 
Yj41 = Up + ahi aA) 


has region of absolute stability Re(Ah) < 0, and therefore the trapezoidal 
method is A-stable. Unfortunately, the trapezoidal method is implicit and 
has order of accuracy p = 2. However, we have the following theorem of 
Dahlquist. 


THEOREM 7.7 
The order of any A-stable implicit multistep method is always p < 2. More- 
over, there exists no A-stable explicit multistep method.'? 


Thus, the trapezoidal method is popular for solving stiff problems, even 


though the order is p = 2, since no multistep methods are A-stable with 
p > 2. The trapezoidal method has the form 


h h 
Yi4 = U5 + oF (tir. 541) + FF ti. Ys), 
which is generally solved for y;+1 by applying a few iterations (normally three 


or four) of Newton’s method to the nonlinear system to obtain y;+1, i-e., we 
need to determine y such that F\(y) = 0, where 


h h 
F(y) =y-yj - af (ti 4s) = af (tis y): 
We will see that Newton’s method is 
wiz1 = wi — (F’(wi))~*F (wr) 


where F’(w7) is the Jacobian matrix. 


18This is in G. Dahlquist, “A Special Stability Problem for Linear Multistep Methods,” 
BIT 8 (1963), pp. 27-43. 
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There are other popular methods for solving nonlinear stiff systems. Non- 
A-stable methods, but with large regions of stability, have been developed 
by C. W. Gear and others which use implicit multistep methods or implicit 
Runge-Kutta methods [51]. An example of an implicit Runge-Kutta method 
originally proposed by Hammer and Hollingworth is 


Yj = Yip +41 + y2Ke 


ky =hf (tj + ah, yj + Buk + Pi2K2) (7.100) 
Ko =hf (tj + a2h, yj + Baki + B22K2), 


where 
_ 1 _1, v8 _!_ v3 
YW = W2= 5) OL 6” by eae 6” 
1 f.. 93 1 V3 
Pr = Poo = 7; P= 4- E> oar me: 


This method has order p = 4 and requires solution of 2n nonlinear equations 
in 2n unknowns per time step. The interval of absolute stability of method 
(7.100) can be shown to be (—oo,0), the same as the trapezoidal rule, but 
with order twice that of the trapezoidal rule. 

It is worthwhile to consider one other type of method that is suitable for 
stiff systems, Padé methods. Recall that we are considering, for determination 
of absolute stability, numerical approximation of 


y! = dy. (7.101) 


Notice now that the trapezoidal method for y’ = Ay can be written as 


(=) 
j+1 = | 7] YI 
Teer 2 


and it is easily verified that (1+4#) /(1—#) is an O((Ah)°) approximation to 
e*”. Typically, a Re step method applied to y’ = Ay, y(0) = yo (whose ex- 
act solution is y = e*'yo) gives approximations which satisfy yj41 = R(Ah)y; 
or yj41 = (R(Ah))/ yo, where (R(Ah))/ approximates e*”, and hence R(Ah) 
approximates e*”. Thus, R(z) must be an approximation to e* for |z| suffi- 
ciently small. 
Now recall Padé approximations to e*. Let Rm.»(z) be the (m,n) Padé 
approximation to e* of the form p(z)/q(z), where p € Pm, q € Pn and 


RY) (0) = 1 for k = 0,1,2,---,m+n. That is, the derivatives of the 
rational approximation Rm,»(z) agree with those of e* at z = 0 up to or- 
der m+n. We saw earlier that Ryn(z) is good approximation to e* and 
Raa Qa + Oe), 

Some of the first few Padé approximations to e* are given in Table 7.1. Note 
that R19 corresponds to Euler’s method, Ro,; corresponds to the backward 


Y. 


wa 
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TABLE 7.1: Padé approximations to e” 


Euler method, and &,,; corresponds to the trapezoidal method. We have 
the following notes on Padé methods. To this end, we first present another 
definition of stability. 


DEFINITION 7.16 A method is A(0)-stable if there is a @ € (0,5) such 
that the region of absolute stability contains the infinite wedge 


So = {Ah| —@ < m —arg(Ah) < 0}. 
(See Figure 7.6.) 


Im 


Ah-plane 


Re 


FIGURE 7.6: Illustration of the region of absolute stability for A(0)- 
stability (Definition 7.16). 


(Notice that if a method is A-stable then it is A(0)-stable.) 


REMARK 7.16 The following classical result holds for Padé methods: 
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The methods corresponding to n > m in the Padé table are A(0)-stable and 
the methods corresponding to n = m,n =m-+1, orn =m-+ 2 are A-stable. 


REMARK 7.17 Only the Padé methods with m < 1, n < 1 are linear 
multistep methods. All the Padé methods are single-step methods and are 
explicit if nm = 0 and are implicit if m > 1. The order of the Padé method, 
corresponding to values m and n in the Padé table is O(h™t”). 


Example 7.11 

Consider the Padé method corresponding, for example, tom = 1 and n = 2. 
We induce the corresponding Padé method by replacing f(t, y) = Ay by the 
general f(t, y). Beginning with the Padé approximation to e*’, we have 


6 + 2(Ah) 
Atth) _ pAtoAM ~w Ort h 
€ ee e {a whence 
eth £6 — A(Ah) + (Ah)? } © e™! {6 + 20h) } 
Replacing e™ by yj, e+" by yj41, EMA by y'(tj) = F (ty, ys), OMA by 
y' (tj41) = F(tj41, yj41), and e+") 2 by 


af af 
y" (tj) = f as vi) = pp tith Yj41) + By tite ti) d (t5415 ¥5+1); 
we obtain 
6yj41 — AAS (tp41, yjti) +A? f (tp41, yi) = Oyy + Wf (tz, yy). 


This simplifies to 


h 2h 2, 
yiti = 95 + sh (tius) + ZS (Gita wits) — GF ita wits): 


It is straightforward to show that the local truncation error for this one-step 
method is O(h*) and hence the global error is O(h?). In addition, this method 
is A-stable. (Exercise 32 on page 437) 


7.8 Extrapolation Methods 


Recall the Richardson extrapolation process. Suppose that Aj(h) is an 
approximation to a number A which depends on a parameter h such as step 
size. Suppose that the error in the approximation satisfies 


Initial Value Problems for Ordinary Differential Equations 427 


A— Ay(h) = c1h t+ coh? +---+ enh® + O(ANt) (7.102) 


for some N > 1 where cj, c2,:-- ,cn are constants independent of h. Further- 
more, 


h h h\? h\™ 
A-A,(=)=az+e(=}) +-:-+ten(=]}] +0(hN*t) (7.103) 
2 2 2 2 
Multiplying (7.103) by 2 and subtracting (7.102) we obtain 


Ais (24 ) as A,(h)) = oh? +h? +s: + eyh® +O(AN*) (7.104) 


Thus, 2.4;(4) — Ai(h) should approximate A better than A1(h) or Ai (4) for 
small h. 
Assuming an error expansion of the form (7.102), the process can be con- 


tinued N times. For notational convenience, let Ao(h) = 24;(4) —A1(h). We 
can then put the extrapolation procedure in the tabular form: 


Ai(h) 

Ai(#) — Aa(h) 

Ai(4)  Ao(#) As3(h) 

Ai(B)  Ao(#) As(#)  Aa(h) 

Ai(#) Ao(#) As(4) Aa($) As(h) 
Accuracy O(h)  O(h2) O(h3) O(h*) O(n) 


IL Ae ny 
where A;(h) = eae) As-1(3) eave, 
Qi-1_—] 

We now consider two numerical methods for solution of IVP (7.1) that have 
error expansion of the correct form for applying Richardson extrapolation. 
One such method is Euler’s method. (It is interesting that Euler’s method is 
a Runge-Kutta method, a Taylor series method, a one-step multistep method, 
and also a method adaptable to extrapolation.) 

Euler’s method has the form: 


fe +hf(t,9);, OS 7SN=1 (7.105) 
0 — Ya- 


For any time t;+41, the error has the form 
Yj+1 — y(ty+1) = her (ty41) +h? ca(tj41) + Rees(tjsa) +> (7.106) 


and thus Richardson extrapolation can be employed. 
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REMARK 7.18 _ It is generally quite difficult to prove that the error has an 
expansion of the correct form. For a comprehensive treatment of Richardson 
extrapolation and error expansions, see [56]. 


REMARK 7.19 The extrapolation procedure can be applied at each 
time step to reduce local error to below a given tolerance in an error control 
method. Suppose that y; is a good approximation to y(t;), to find y;41, the 
following table is computed: 


step | Euler | Extrapolations 


A 

2 

h 

2 W3,1 | W3,2 W3,3 
A 

8 , 


where w,z,1 is the Euler approximation to y(t;+1) using step st z with the 


initial value y;. When ||wiz. — wi-1,-1]| < €, then yj41 = wi. 0 


Example 7.12 
(n = 1) 
{ y'(t)=t+y 
y(1) =2 
(exact solution y(t) = —t —1+4e7te*). Let « = 1074 and initial h = 0.1 and 
Yo = 2: 


h Euler Extrapolations 


0.05 2.31 2.3200 
0.025 | 2.315252 | 2.3205 2.32067 
0.0125 | 2.317944 | 2.32064 2.32068 2.32068 (stop) 


Set y; = 2.32068. Now compute yp similarly. 


Generally, recalling Romberg integration, Richardson extrapolation is much 
more effective when the error expansion is of even order, i.e., 


A— Aj(h) = coh? + cah* + cgh® + conh?® + O(n?4 t+") (7.107) 


where now 


2 


43-1 A;_1 (2) — Aj_1(h) 
4i-1 1 


A;(h) = 
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However, it is very difficult to find explicit methods with even-order error 
expansions. Gragg’s method is one such method. 

In Gragg’s method, Euler’s method is applied first, then the 2-step midpoint 
method is applied, and finally a smoothing step is applied. Gragg’s method 
has the form: 


Wi = wo + hf (to, wo) Euler first step 
Witt = W;-14+ 2hf(ti,w:), *=1,2,---,M midpoint method 
1 1 1 
WM+1 = quM+1 + 3 UM + quM-1 smoothing step. 
(7.108) 


Then WM+1 © y(tar), ty = Mh+to 


Example 7.13 


ee =t+y, 
y(1) = 2. 


(The exact solution is y(t) = —t-— 1+ 4e7'e’ and y(1.1) = 2.3206837.) Let 
€ = 10~‘ and initial h = 0.1. 


h WM4+1 ~ y(1.1) 
0.1 2.32 
0.05 2.3205 2.32067 
0.025 2.32063766 2.32068 2.3206847 
Accuracy | O(h?) O(h*) O(h®) 
Now set yi = 2.32068 and compute ye similarly. 


REMARK 7.20 _ A variation of Gragg’s method called the GBS method 
(obtained by applying rational extrapolation to Gragg’s method) is one of the 
most efficient general purpose methods for numerical solution of initial-value 
problems [50]. In a survey in 1971, the GBS method emerged as the best 
method of those tested when function evaluations were relatively inexpensive 
‘ee each function evaluation required less than 25 arithmetic operations). 


REMARK 7.21 — For more information on numerical methods for initial- 
value problems see, for example, [37], [50], [83], [84], or [90]. 
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7.9 Application to Parameter Estimation in Differential 
Equations 


Many phenomena in the biological and physical sciences have been described 
by parameter-dependent systems of differential equations such as those dis- 
cussed previously in this chapter. Furthermore, some of the parameters in 
these models cannot be directly measured from observed data. Thus, param- 
eter estimation techniques discussed in this section are crucial in order to use 
such differential equation models as prediction tools. 

In this section we focus on the following question: given the set of data 
{d;}j—1 at the respective points t; € [0,7], 7 =1,...,n. Find the parameter 
a € Q where Q is a compact set contained in C[0, 7] (the space of continuous 
functions on [0,7]) which minimizes the least-squares index 


S ly(ty3@) — 43? 
i=l 


subject to 
— =f(t,y;a), y(0;a) = yo, 


where y(t; a) represents the parameter-dependent solution of the above initial- 
value problem. We combine two methods discussed in this book to provide 
a numerical algorithm for solving this problem. In particular, we will use 
approximation theory together with numerical methods for solving differential 
equations to present an algorithm for solving the least-squares problem. To 
this end, divide the interval [0,7] into m equal size intervals and denote the 
bin points by to,ti,...,tm. Let y; be a spline function (e.g., linear or cubic 
spline) centered at t;, i = 0,...,m and define a(t) = 37)", ciyi(t). Denote 
by yxz(a) the numerical approximation (using any of the numerical methods 
discussed in this chapter) of the solution of the differential equation y(t,; a), 
k=1,...,N with th -—th_y =h= <. Let y(t; a) be a piecewise interpolant 
(e.g., piecewise linear) of y,(a) at the points t;. Then one can define an 
approximating problem of the above constrained least-squares problem as 
follows: Find the parameter a” ©€ Q™ where Q” is the space spanned by 
the m+1 spline elements Yo,...,Qm which minimizes the least-squares index 


Sole Ge) =a. 
j=l 


Clearly, the above problem is a finite dimensional minimization problem and 
is equivalent to the problem: Find {c;}”, C R™*! which minimizes the 
least-squares index )~j"_, |y (tj3c0,---,¢m) — d,|?. One can apply many op- 
timization routines to solve this problem (e.g., the nonlinear least-squares 
routine “LSQNONLIN,” available in MATLAB, works well for such a problem). 
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7.10 Exercises 


1. 


2. 


Prove the roundoff error bound (7.24) on page 388. 


Suppose we consider an example of the initial value problem (7.1) (on 
page 381), such that a = 0, b = 1, such that y and f are scalar valued, 
and such that f(t, y(t)) = f(t), that is, f is a function of the independent 
variable t only, and not of the dependent variable. In that case, 


y(1) = y(0) + i _ Lote 


(a) To what method of approximating the integral does Euler’s method 
correspond? 


(b) In view of your answer to item 2a, do you think Euler’s method is 
appropriate to use in practice for accuracy and efficiency? 


(c) Interpret the total error (roundoff plus truncation) bound (7.24) in 
this special case. (One possibility is to take the limit in (7.24) as 
L-—0.) 


. Show that Euler’s method fails to approximate the solution y(x) = 


3 
(3x)? of the initial value problem y/(x) = y?,y(0) = 0. Explain why. 


. Show that y/(t) = §cos(2y) + t?,y(0) = 1, has a unique solution for 


lt] < 10. 


d 
. Consider the initial-value problem + = a+by(t)+csin(y(t)), O<t<1 


where y(0) = 1 and a,b,c > 0 are constants. Let us suppose that the 
solution satisfies max, ly" (t)| = M < oo. Consider the approximation 


Yer1 = Ye + (a+ bye +csin(y,))h for k =0,1,2,...,.N—1, yo =y(0), 
And hE: Prove ean a) ae 
n =>. rove a —= PAE Er 

N ire. Oey) 


. Consider the initial-value problem = = f(y,t), y(0) = yo, for 0 < 


t < 1. Suppose that |f(y,t) — f(z,t)| < Lly — z| for 0 <t < 1 and 
y,z € R. Also, suppose that the solution y(t) satisfies qnax ly” (t)| = 


M. Consider the numerical scheme yn41 = Yn + hf (an, tn) + €n for 
n=0,1,2,...,N—1 where t, = nh, h=1/N and yy © y(tn). The en 
are rounding errors and |e,| < 6 for all n. Prove that there are constants 


6 
C1, C2 > 0 such that, |y(1) — yn| < ah+ Caz. 
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10. 


11. 


12. 


Classical and Modern Numerical Analysis 


. Consider Euler’s method for approximating the IVP y’/(x) = f(a, y),0 < 


a < a,y(0) = a. Let yp(tizi) = yn(ti) + hf (ei, yn(ai)) for i = 
0,1,...,N where y,(0) = a. It is known that yp;,(a;) — y(ai) = aht 
coh? + e3h3 +... where ¢m,m = 1,2,3,... depend on 2; but not on 
h. Suppose that yn (a), yx (@), yx (a) have been calculated using interval 


width: h, a 7 respectively. Find an approximation §(a) to y(a) that is 
accurate to order h?. 


. Compare the midpoint method for IVP’s (formula (7.30) on page 390) 


to the midpoint rule for quadrature (formula (6.28) on page 341). 


. Duplicate the table on page 396, but for h = 0.05 and h = 0.01. (You will 


probably want to write a short computer program to do this. You also 
may need to display more digits than in the original table on page 396.) 
By taking ratios of errors, illustrate that the global error in the order 
two Taylor series method is O(h?). 


Suppose that y(t) = t+ 2ty” + 2t?y(t), 1<t<2, yA=1 y/()= 
2, y”(1) = 3. Convert this third order equation problem into a first- 
order system and compute y,; for k = 1,2 for Euler’s method with step 
length h = 0.1. 


Consider the initial-value problem 


IO) =fUD. 0st 1-40} =F. 


Suppose 
Ww 
— _ < _— 
max |y"(O| = M <oo, [f(tu)— f(»)] $ Dlu— 
and df df 
mals a < L2lu— <t< 
poe) eee) <L*ju-v|j, O<t<1, 


for constants LD and M. Let h=1/N and 

h? df 

yi =U thftu)+ >a tay) J=01,...N. 

Prove that max, ly; — y(t;)| < CMh?, where C is a constant indepen- 
SiS 

dent of h. 

Consider the three stage Runge-Kutta formula 


Yer = Ye th(y1ki +72K2 + 73Ks), 
Ky = f (te, Ye) 
Ko = f (te + azh, yx + hGo1 Kk), 
K3 = f ((te + ash, ye + hG31k1 + B32K2))- 


13. 


14. 


15. 


16. 


17. 
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Determine the set of equations that the coefficients {y,;,a,,3;;} must 
satisfy if the formula is to be of order 3. Find a particular set of values 
that solves these equations. 


Calculate the real part of the region for the absolute stability of the 
fourth order Runge-Kutta method (7.52). 


Consider the Runge-Kutta method 


h h 
Yiti = Yi thf(tit guid gi (tisy)). 


Apply this method to y’ = dy to find the interval of absolute stability 
of the method. (Assume that Ah < 0.) 


Consider numerical solution of the initial-value problem y’(t) = f(t, y(t)), 


0<t< 1, y(0) = yo = 0 using the trapezoidal method 


h 
Ykti = Yk + 5 (f (tks Ye) + f(thet1s Ykt1)) 


for k =0,1,--- ,N—1, where N = 1/h and t, = kh. Suppose that 


mt < d t = roe hacia 5 eee 
gmat ly (t)|} <M and |f(t,z)—f(t,Z)| < Llz—-Z| 


for all z,2 € R. Assuming that hL < 1, prove that 


ss < ch? 
oma, los y(tk)| < ch’, 


where the constant c does not depend on h. 
Consider the one-step method 
yo = y(to), Yj+1 = Y¥5 +AP(t;, yj, h), 


where 


h 
B(t;,y;,h) = of(tious) taf (+ + 04; + Bhs(t3.15)) 


Determine a that will make the method consistent and determin ( that 
will make the method of order of accuracy p = 1. Can ( be chosen so 
that the order of accuracy is p = 2? 


Find the region of absolute stability for 


(a) Trapezoidal method: 


h : 
Yjt1 = Y5 + RCACZEE ED + f(tj41,9341)), J =0,1,---,N-1. 
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19. 
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(b) Backward Euler method: 
Vii = 9p TDS (Gary); j=0,1,--- N-1. 
Consider y'(t) = f(t, y) for 0 < t < 1, and assume y € C?(0, 1]. Let 


" _ MW — 
qnax, ly (t)| =k2 and gna, ly (t)| = ks. 
Let yi41 = yi + A®(t;,y:,h) be a single-step explicit method. Suppose 


that 
2 


h 
ui + Fy" — note, vO. 1)] Se! 
and 

I(t, 9,4) — B(E,9,m)| <MIg— A] for GER. 


(a) Prove that |y(t;) — y;| < cgh? for some constant c2, for any 1 <i < 
N. 


(b) Prove that 
L<isQN. 


y! <csh for some constant c3, for any 


Yis1 — Yi 
t;) — ——— 
ee 


Consider solving the initial value problem y’ = Ay, y(0) = a, where 
A <0, by the implicit trapezoid method, given by 


h : 
Yo =, Yitr = Yi t 5 [flit yirr) + fiw), 0<i<N-I, 


t; = ih, h = T/N. Prove that any two numerical solutions y; and §; 
satisfy 

lve — Hil <e* |yo — Gol 
for 0 < t; < T, assuming that Ah < 1, where K = 3AT/2 and yo, Go 
are respective initial values with yo 4 Go. (That is, y; and §; satisfy the 
same difference equations except for different initial values.) 


Consider the initial-value system 


d 

77 =(I-Bt)'y, y(0)=y, yteR", O<t<1, 
where B is an n X n matrix with ||Bl|.. < 1/2. Euler’s method for 
approximating y(t) has the form 


Yar = wth — Bt)-y = (I+hAU—Bt)yz,, i=0,1,---,N—1, 


where t; = ih and h = 1/N. Noting that ||Bt;||.. < 1/2 for all 7, prove 
that 
IIYittlloo < (1 + 2h)|I¥illoo 
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fori =0,1,---,N—-—1 and 


IIyvlloo $ €*[IYolloo 
for any value of N > 1. 


21. Show that the implicit Simpson’s method (defined by Equation (7.67) 
on page 405) has order of accuracy 4. 


22. Consider the following two-step method: 


Yoo + O1yeri + Coyne = hBS (te+2,Yr+2) 
for solving the initial-value problem y’(t) = f(t, y). 


(a) Find ap, a1, such that the method is second order. 
(b) Is the method consistent? Is so why and if not why not? 
(c) Is the method stable? Is so why and if not why not? 


23. Consider the initial value problem 


dy _ 


His f(t, Y); y(a) = 7) 


for the function y(t) over the interval a < t < b. Consider the general 
multistep method on the discrete point set defined by x, = a+ nh for 
n = 0,..m with h = (b—a)/m. If we write yn = y(tn) and fr = f(tn, Yn) 
the general multistep method takes the form 


k k 
So oj ung =h >” Bifnsz- 
j=0 j=0 


(a) Assuming a, = 1, construct the implicit linear two step method 
containing one free parameter ao = c. 


(b) Find the order of this two-step method as a function of the param- 
eter c. Determine the value of c in which this two-step method has 
maximal order. 


(c) Find a value of the parameter c for which the method is explicit. 


24. Consider the problem y(t) = t, y(0) = 0 with the exact solution y(t) = 
t?/2. Consider the three methods: 


(i) a + yj+i — 2y; = 2hf (tj, yj) with yo = y(0) = 0, y1 = y(h) 
h2/2, 


(ii) yy — yy = 2hf (tj, yj) with yo = y(0) = 0, 
(iii) yjoi — yy = 2hf (t541, yj+i) with yo = y(0) = 0, 


436 


25. 


26. 


27. 
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where t; = jh. Show that method (i) is consistent but not stable, 
method (ii) is stable but not consistent, and method (iii) is stable and 
consistent. 
Determine the value of c that will make the multistep method 
Yj+2 = —4yj41 + Sy; + chfjg1 + 2hf; 

consistent. 
Determine whether or not the method in Exercise 25 is stable. 
Consider the predictor-corrector pair 

JO. = ye + hf (ths Ye), 

Yar = Yk + ohf (te, Yer) + (L— a)hF (te Ye): 


where 0 < a < 1 isa parameter. Suppose that a = z. Find the interval 
of absolute stability of the resulting method. 


Develop a general-purpose routine for solving the IVP y’ = f(t,y), 
y(to) = @, using the following Adams predictor-corrector algorithm. 


ALGORITHM 7.1 

(A predictor-corrector scheme) 
INPUT: the end points a and 5b, the initial condition a, and the number 
of subintervals N. 


OUTPUT: approximate values {¢;, yi}nng of the solution y at the points 
t =at Si. 


1. Given yo = a, use the fourth-order Runge-Kutta method to deter- 
MINE Y1, Y2; Y3- 
2. DO forn=838,:--, N-1 
(a) Use the four-step Adams—Bashforth Method for i = 3, 4, ..., 
N —1, as predictor to compute the approximation Weng using 
the values Yn; Yn—1; Yn—2; Yn—3- 
(b) Use the three-step Adams—Moulton Method, for i = 2, 3, ..., 
N —1, as corrector to compute the approximation ee using 
the values eas Yn> Yn—1> Yn—2- 
(C) Ynti — Ung 
END DO 


END ALGORITHM 7.1. 
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29. Consider solving the IVP y’ =t+y,0<t <1, y(0) = 0 using the 
program developed in Exercise 28 with N = 10, 20, 40, 80, 160, 320, 
640. 


(a) Corresponding to each N, let the approximate solution computed 
at the right end point be denoted by yi, yo, y3, Y4; Ys; Yb, Y7- 


(b) Compute the exact solution to this problem evaluated at the right 
end point, and denote it by ye. 


(c) Compute the error e; = |y; — ye|, i = 1,--- ,7 corresponding to 
each N. 
(d) Compute the ratios a; = e;41/e;, i= 1,--- ,6. 


What does a; converge to and why? 


30. Develop a general purpose routine to solve systems of differential equa- 
tions by suitably modifying Algorithm 7.1. Use this new program with 
N = 10, to do the following: 


(a) Solve the second-order IVP y” + 3y’ + 2y = 6e' on 0 < t < 1, 
given y(0) = 3 and y’(0) = —2. Determine the exact solution to 
this problem. Plot the exact solution and the computed solution 
over 0 < t < 1 and verify that the computed solution compares 
favorably with the exact solution. 


(b) Express the system of equations 


dz 
apa? nytel 
d2 
a fae 


with the initial conditions, 
2(0)=2'(0)=0, y(0)=1, y(0)=—-2 
as a system of first-order equations and solve it from t = 0 tot = 1. 


31. Show that, when A has simple eigenvalues, the solution to the linear 
initial value problem (7.95) on page 418 is (7.96). Hint: Letting P be 
the matriz whose columns are the n linearly independent eigenvectors 
of A, make the transformation y = Pz, and solve the resulting linear 
system in Zz. 


32. Show that the method presented in Example 7.11 on page 426 is A- 
stable. 


33. Show that the method presented in Example 7.11 on page 426 is of order 
O(h3). 
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34. Consider the following time-dependent logistic model for t € [0, 2]: 


Y — a(t 4), (0) =4. 

(a) Find parameters c; to approximate the time-varying coefficient 
a(t) & bee cipi(t). Here, y; denotes the hat function centered at 
t;, with respect to the nodes [to, ti, t2] = [0,1,2]. (See page 239.) 
Compute those c; which provide the best least-squares fit for the 
(t,a) data set: 


{(0.3,5), (0.6, 5.2), (0.9, 4.8), (1.2, 4.7), (1.5, 5.5), (1.8, 5.2), (2, 4.9)}. 


(b) Solve the resulting initial value problem numerically. Somehow 
estimate the error in your numerical solution. 


35. Show that the exact solutions of the difference equations (i), (ii), and 


(iii) of Exercise 24 are given respectively by: 


That is, show that these solutions satisfy the difference equations and 
the initial conditions. 


Now, compare the exact solution y(t;) = j7h?/2 to these solutions. 
In particular, consider h = 0.1, 7 = 10, and h = 0.01, 7 = 100 for 
estimating y(1) = 1/2. 


Chapter 8 


Numerical Solution of Systems of 
Nonlinear Equations 


In this chapter, we study numerical methods for solution of nonlinear systems. 
Two classic references on numerical solution of nonlinear systems are [70] and 
[73]. 

We are interested in finding x = (x, x2°--- a € DC RN that solves 


F(x) =0, (8.1) 


where F(x) = (fi(z), fo(z), fa(x),---fn(z))7?, F : DC R”® > R®. For 
example, 


fi (v1, £2, £3) e712 4 x) 4+ 5agrt 
F(a) = | fo(a1, 22,23) | = | 14+ 321 + 405 + 2301 
f3(X1, £2, £3) 4+ ty — 2%9 + 4273 


Here, we consider iterative methods of solution of these nonlinear systems, 
the most general techniques. 


8.1 Introduction and Fréchet Derivatives 


In this section, we review Fréchet derivatives, a useful concept throughout 
this chapter. 


DEFINITION 8.1 A mapping F : D C R” — R” is Fréchet- or F- 
differentiable at an interior point x of D if there exists a linear mapping 


1Symbolic algebra methods have become more popular in recent years, and are sometimes 
combined with iterative methods, as a way to preprocess systems. Also, exhaustive search 
methods, such as explained in §9.6.3 of this book, can sometimes be used for small systems 
to find all solutions. Continuation methods are sometimes also used to find all solutions. 
However, the iterative methods of this chapter remain a basic element of most software for 
finding solutions, and are the primary element when the system is large. 
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A:R”" — R” such that for any h € R”, 


1 
lim —||F(a@ +h) — F(x) — Ahl| = 0, (8.2) 
ho ||Al| 
where ||-|| ts a vector norm in R”. 


REMARK 8.1 Aisannxn matrix dependent on point x, ie., A= A(z). 


DEFINITION 8.2 The linear map A for which (8.2) holds is called the 
Fréchet derivative of F at x, and is denoted F'(a). 


Now, suppose that 


F(x) = (fi(2), fo(2), fa(@),-+- fala)”, 


where f;(x) has continuous first partial derivatives on D. Let (F’(x)j;) = 
aij(x) be the (2, 7)-th element of A. Since convergence in norm implies com- 
ponentwise convergence, (8.2) with h = te;, where e; = (0,...0,1,0,...0)” is 
the j-th unit vector, gives 


fim [fl Lah eO me ieeen: Ta) 
However, 
Pane ey x J OTK@) 
lim 5 Fila + te;) — fi(x)) = OER: (evaluated at x). 
Hence, (8.3) implies that (F’(x);;) = ai;(x) = eel l<i<n, l<j<n. 
es 
Thus, : 
Ohi) Aig ahi, 
Ox Ox2 "" Oan 
F(a) =| 0x, 0x2 Othe : (8.4) 
Bfn (a) Ofno, Of 


aa x Ding (a)... Bae (x) 


DEFINITION 8.3 The matrix of partial derivatives in Equation (8.4) is 
called the Jacobian matrix for the function F. 
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Example 8.1 


If 
Fey= (oy = ee) 


fo(x1, £2) 12% — 21 + 2x2 
; cos 29] — ay sin x + 2arge”? 
PROVEN GE al) Ora 
2 1%2 


Mean-value theorems as well as generalized Taylor series in many dimen- 
sions can be derived using Fréchet derivatives. A thorough theoretical treat- 
ment is presented in [104]. For example, a mean value theorem for a function 
F:R” — R” can be stated as 


then 


[ 


THEOREM 8.1 

(A multivariate mean value theorem) Suppose F : D CR” > R” has contin- 
uous first-order partial derivatives, and suppose that x € D, & € D, and the 
line segment {&+ t(a — &)|t € [0,1]} ts in D. Then 


F(a) = F(é) + A(x — &), (8.5) 
where A is some matrix whose i-th row is of the form 
Ofer Of. O Fe 
(FA, Fate, FA), 


where the c, ER", 1 <i <n are (possibly distinct) points on the line between 
z and x. 


PROOF The proof is Exercise 1 below. 


Higher-order Taylor expansions are of the form 
1 
F(a) = F(#)+ F’(&)(a@ — &) + sf" @@ —£)(~@—-#)+..., (8.6) 


where F” is the Jacobian matrix as in (8.4) and F”, FP”, etc. are higher-order 
derivative tensors. For example, F”’(x) can be viewed as a matrix of matrices, 
whose (i, 7, k)-th element is 
O* fi 
Ox jOL, 
and where F’”’(%)(a — &) can be viewed as a matrix whose (i, j)-th entry is 
computed as 


(F"(@)\(e@- 2), , = S- 
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Just as in univariate Taylor expansions, if we truncate the expansion in Equa- 
tion 8.6 by taking terms only up to and including the k-th Fréchet derivative, 
then the resulting multivariate Taylor polynomial T;,(x) satisfies 


F(x) = x(x) + O(||2 — €||**"). 


8.2 Successive Approximation (Fixed Point Iteration) 
and the Contraction Mapping Theorem 


We now consider the successive approximation method. This is a close 
multidimensional analogue to the univariate fixed point iteration method dis- 
cussed in Section 2.3 on page 39. In turn, there are also infinite-dimensional 
analogues to this development that are useful in the analysis of differential 
equations. 

Suppose that a solution of F(x) = 0 is a fixed point of G, i.e, x = G(x). 
We have the iteration scheme 


ct) — Gc), &>0, 2 given in R”, (8.7) 


where 


G(x) = (g(a), go(x), ae gn(a")))*, (8.8) 


DEFINITION 8.4 Let G: DC R” — R”. Then a* is a point of at- 
traction of iteration (8.7) if there is an open neighborhood S of x* such that 
S CD and, for any © € S, the iterations {x} all lie in D and converge 
to x*. 


We now begin with the contraction mapping theorem. First, 


DEFINITION 8.5 <A mapping G: DCR" — R” is a contraction on a 
set Do C D if there is an a <1 such that ||G(x) — G(y)|| < alla — y|| for all 
z,y€ Do. 


THEOREM 8.2 

(Contraction Mapping Theorem) Suppose that G : D C R” — R” is a con- 
traction on a closed set Do C D and that G: Do — Do, 1.e., if x € Do, then 
G(x) € Do. Then G has a unique fixed point «* € Do. Moreover, for any 
2) € Do, the iterates {a\")} defined by 2*+) = G(x) converge to a*. We 
also have the error estimates 


je — "| < Ja — 2-9], b= 1,2,--- (8.9) 
a 


1 — e) ’ 
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k 
2 — 2" < FIG) - 2 J, (8.10) 


where a is as in Definition 8.5. 


PROOF (This is similar to the proof of Theorem 2.3 on page 40, the 
univariate contraction mapping theorem.) Let «© be an arbitrary point in 
Do. Since G(Do) C Do (where G(Do) is the set {G(x) | « € Do}), the 
sequence defined by a+!) = G(x) is well defined and lies in Do. By 
Definition 8.5, 


jo®D — | = G@®) — G(2®)] < alje® —24-Y] for k>1. 


(8.11) 
Repeated application of (8.11) yields 


Dp 
Jc@tP) — 2) < s jad — a (h+i-1) | 
i=1 
(aP-1 4 gP-?2 4...-4 1) JcFt) — 2] 


IA 


k 
a 
jx) = 2] <A o — 2], (8.12) 


IA 


Thus, by (8.12) {a*)} is a Cauchy sequence (that is, there is an no such that 
lfm — fn|| < € whenever m,n > no) in the closed set Do C R”, with respect 
to any norm. Therefore, since Cauchy sequences in R” converge to elements 
in R”, 

lim 2) =2* and x* € Dp. (8.13) 


In addition, 


lI2* — G(a*)|| = |le* — 2) + G(r") — Ga") | 
S ||x* — c**)|| + ||G(2*) — G(2"*)| 


< |la* —2®4)]| + alja* — 2*| 


Thus, since G is continuous, «* = G(a*), ie., x* is a fixed point of G. Fur- 
thermore, the limit is unique, since if there were two fixed points xj # x4 in 
Do, then lle} — 23|| = |G(#i) — G(a3)|| < allt — 25] < |lat — 25]|, which is a 
contradiction. Finally, error estimates (8.9) and (8.10) follow from (8.12) by 
letting p — oo and observing that 2 = G(a), 


Example 8.2 
Consider 


3x1 —cosajtq —£ =0, oraz, = 4c0osa142 +4 
2 , 3 6 


20%2 +e77172+4+8=0, orrtg= — ge i =. 
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Thus, the related systems F(x) = 0 and G(x) = x can be written 
i —212 T 
F(a) = (3x1 — cos 2122 — 57 2022 +e 7172 4 8)r 


and 


He ROE ea eee 
= (Gilt), 92 me.) 172 T 6s 5H Th : 


Let Do = {(21, 22) € R?: —-1 <2; <1 fori =1,2}. We will use Theorem 8.2 
to show that the iterations «+ = G(a*)) converge to a unique fixed point 
x* € Do. Consider 


1 it 
|gi(@1, 22)| = soosnita +5] 
een ee 
~ 6 3 —) 
=. 1 —X%1%2 2 
|92(@1,%2)| = 20° +3 
eee Ee 69 for —1l<a,%2 <1. 
~ 5 20 — ao Age 2 


Thus, |gi(x)| < 1 for i = 1,2 and hence —-1 < g;(x) < 1 for i = 1,2 if 
x € Do. (In other words, G(x) € Do whenever x € Do.) Now consider 
showing G is a contraction on Dp. We need to show that there is an a < 1 
such that ||G(x) — G(y)|| < alla — y|| for all x,y € Do. 


In Example 8.2, showing that G is a contraction can be facilitated by The- 
orem 8.3 below. To present Theorem 8.3, we first review the following defini- 
tion. 


DEFINITION 8.6 _ A set Do is said to be convex, provided 

At +(1—A)y€ Do whenever x € Do, y € Do, and A € [0,1]. 
THEOREM 8.3 
Let Do be a convex subset of R” and G be a mapping of Do into R” whose 


components 91, 92, °**; Jn have continuous and bounded derivatives of first 
order on Do. Then the mapping G satisfies the Lipschitz condition 


|G) -— GY) < Llla—yl|_ for all x,y € Do, (8.14) 
where L = sup ||G’(w)||, where G’ is the Jacobian matrix of G (i.e., G’ is 


weDo 


the Fréchet derivative of G). If L<a<1, then G is a contraction on Do. 
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REMARK 8.2 _ ||-|| signifies a vector norm and the corresponding induced 
matrix norm, ie., ||Al]| = sup 4! Thus, ||Azl] < ||Aili2'. 
I|2||40 


PROOF _ Since Do is convex, « + s(y— ax) € Do for 0 < s < 1 for any 
y,x € Do. Let 


®,(s)=g;(a@+s(y—2)), O<s<1, j=1,2,...,n. (8.15) 


Observe that ©; is a continuously differentiable function of s on [0, 1], because 


d® ; S 0g; Ox 0g; OX _ 0g; 0g; 
ds 0x1 Os Or, Os Br, a) + 57 (Un ®n) 
With these functions ®,, 
1 
as(w) — aj(0) = 8;(1) (0) =f &,(s)as (8.16) 
and 
Og; (w 
(8) = >> 95 di — &x), (8.17) 


where w= «+ s(y— 2). 
Let ©(s) = (©1(s), ®2(s),--+ ,®p(s))7. Then (8.16) and (8.17) give 


1 
Cem = | &'(s)ds (8.18) 
0 
and 
&'(s) =G'(w)(y—2), (8.19) 
where G’(w) is the Jacobian matrix of G evaluated at w = «+ s(y— 2). 
Thus, 
1 
|G) — G(x)|| = if ®'(s)ds]] < sup ||®’(s)||. (8.20) 
0 O<s<l 


To see the above inequality, note that 


But 
ee (=) = <> Jo" (<)| 232: sae eS): 
RP Ne N/NWN ~ o<sx1 
Hence, 
7 af y\ 1 
Lf ®5(s)ds]) = lim ee (+)= < eur [eel 
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Inequality (8.20) and Equation (8.19) now give 


NG) = Gla) 80D, I|®"(s)|| < oe. |G"(w)Illly — all. 


Hence, 


IG(y) — G@)|| < Llly— al], where L = sup||G'(w)|. 


Now let’s apply this theorem to Example 8.2. We have 


1 1 
ag dgy =£2sin(a22) =21 sin(x122) 
1 dx, dx2 3 3 
dga dga t2 —X1XQ hd: —2X1L2 
dz, dx2 20 20° 
GE ee Dy | Ot oe da) gi» gg COE) ae 
dx, — 3 dx2 ms) dx, ~ 20 “2 ~ 20 
for « € Do = [-1,1] x [—1, 1]. Thus, 
2 
G'@)lle $3 =L- 


Hence, G is a contraction on Do, G(Do) C Do, and Theorem 8.2 implies that G 
has a unique fixed point in Do and the iterations defined by «+! = G(x) 
converge to this unique fixed point. 

We also have 


THEOREM 8.4 

Let x«* be a fixed point of G(x), and assume the components of G(x) are 
continuously differentiable in some neighborhood of x*. Furthermore, assume 
that ||G’(a*)|| <1, where ||- || 1s some vector norm and corresponding induced 
matrix norm. Then, for xo chosen sufficiently close to x*, «*+) = G(a™)) 
will converge to «*, and the results of Theorem 8.2 will be valid on some 
closed, bounded, convex region about x*. 


PROOF Pick a number X satisfying ||G’(x*)|| < A < 1. Then, choose a 
set 
Do = {x ER”: ||z — x* || < €}, 


with L = max \|G’(x)|| < A <1. We have G(Do) C Do, since ||x* — a|| < € 
implies that 
\|a* — G(x)|| = ||G(@") — G(@)|| < La" — al < Alla* — al] <e. 
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Thus, G maps Dp into Do and is a contraction on Do, so Theorem 8.2 can be 
applied. 


8.3. Newton’s Method and Variations 


We specialize the results of the previous section to iterations of the form 


okt) = ol) — (A(r))) A F(2), (8.21) 


G(x) = 2—A'(2x)F(2), 


where A(x) is an n x m matrix whose elements are functions of x. Assuming 
A(x) is nonsingular, x = G(a) (ie., x is a fixed point of G) if and only if 
F(x) =0. 


REMARK 8.3 Equation (8.21) is equivalent to 
aF+}) — of®) 4) where v™ solves A(a*))u = -—F(a2), (8.22) 


where (8.22) is how the iteration is implemented in practice. ] 
REMARK 8.4 If A(x) = F’(z), then (8.21) is called Newton’s method. UJ 


8.3.1 Some Local Convergence Results for Newton’s Method 


We will derive a local convergence result for Newton’s method. First, we 
consider several preliminary results. 


LEMMA 8.1 
Let A be a nonsingular matrix. Suppose that B is a matrix such that || A~+||||Bl| < 
1. Then, A+ B is nonsingular, and 


A*| 


A+B)"||< ——V 
MAS BY TS TAT 


(8.23) 


PROOF A+B = A(I+A-B), but |A-1B]] < ||A-||||Bl] <1. Thus, 
p(A~!B) < 1,80 A(J+A~*B) is nonsingular, from which it follows that A+ B 
is nonsingular. Also, (A+ B)~' = (I+ A7!B)~!A7}, so 


Aq*| 


IAe) |= ae 
1— ||A“* {IB 
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(Recall if ||C|| < 1, then (I — C)-! =I +C+C?+---, so 


if 
I—C)*|| <14+|Cl] + |e? +---= ——. 
II( "ll IC] + IC Taal 


THEOREM 8.5 

Suppose that F : DC R” — R” is F-differentiable at a point x* in the 
interior of D at which F(a*) =0. Let A: So > R” be a linear mapping (i.e., 
Aisannxn matrix), let A be continuous at «*, and let A(x*) be nonsingular. 
Then, there exists a closed ball S(a*,5) C So, S={x€ So CD: |lx—2*|| < 
6.}, 5 > 0, on which the mapping G: S > R", G(x) = « — A7'(a) F(a) is 
well-defined. Moreover, G is F-differentiable at x* and 


G'(x*) =I — (A(a*))~1F"(2*). 


(See the following figure.) 


PROOF G will be well-defined on S if A is nonsingular on S. Let 6 = 
|| A~!(x*)|| and let « > 0 be such that 0 < € < aa By hypothesis, A(x) is 
continuous at 2*. Therefore, there is a 6 = 6(x*,€) > Osuch that S(x,6) C So 
and 


| A(z) — A(a*)||<e forall xeS. (8.24) 


We now use Lemma 8.1 with A = A(x) and B = —A(a*)+ A(x), ie, A+ B= 
A(x). Since 


|A7*(a*)|I||A(@) — A(2*)|| < €6 e ial ees. 
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A(z) is invertible for all x € S. Moreover, 
||(A(a*) + A(z) — A(x*))~* | 


|A~* (2) 
1 ||A~*(@*)IIA@) — A@*)II 


|A~*(2)| 


I 


(by Lemma 8.1) 


IA 


=268 for ze 58. (8.25) 
Thus, G is well-defined for all  € S. We now show that G is F-differentiable 
at «*. First, since * is a solution of F(a) = 0, x* is a fixed point of G, ie., 

G(a*) =a". (8.26) 


Also, since F is F-differentiable at 2* and by choosing 6, the radius of $(a*, 4), 
sufficiently small, and since 


|F(@) — F(a") — F'(@")(a — 2") | 


\le—a*|| +0 |e — «*|| 


we conclude that 
|F (2) — F(x*) — F’(2*)\(x — x*)|| < ellz—2*|| forall re 8. (8.27) 


Now, for x € S, 


I 
aN 
- 
ZN 
8 
* 
Se 
y 
aS 
8 
= 
| 
8 
x 
| 
ae 
w 
— 
8 
WN 
zy 
a 
8 
x SS —nN~ 


) 

< ||A“*(@)[F(a) — F(a*) — F'(a*)(2 — 2") 

+||A7* (a) (A(x2*) — A(a)) AW" (2*)F"(x*)(a — x*)|| since F(x*) =0 

< 2ella — 2*|| + 467e||F’(x*)||||2 — x*|| using (8.24), (8.25), and (8.27) 
< «C||z —2*||, where C = 23 + 46?||F’(a*)|| is a constant. 


The above computation, combined with the definition of Fréchet derivative 
(Definition 8.1), shows that G is F-differentiable at x* and G’(a*) = I — 
ATL Ra |. 


Before proving a local convergence theorem for iteration (8.21), we intro- 
duce the following lemma. A proof of this lomma can be found, for example, 
in [70]. 


LEMMA 8.2 

(Ostrowski) Assume that G: D Cc R” —> R®” has a fixed point x* in the 
interior of D and that G has a Fréchet derivative G' at «*. Then, if the 
spectral radius of G’(x*) satisfies p(G’(x*)) = 0 < 1, it follows that x* is a 
point of attraction of the iterates e*+)) = G(2™). 
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We now have an attraction result for iteration (8.21). 


COROLLARY 8.1 
(to Theorem 8.5) Assume that the hypotheses of Theorem 8.5 hold. In addi- 
tion, suppose that 


p(G' (a*)) = p(I — A“ (a*)F'(a*)) =o <1. (8.28) 


Then x* is a point of attraction of the iterations 8.21. 


PROOF Lemma 8.2 and Theorem 8.5 imply that x* is a point of attraction 
of iteration 


ht) — gl’) — Ale) P(e) for k=0,1,2,---. 


REMARK 8.5 A special case, in which p(G’(2*)) = 0, is when A(x) = 
F’(x). This corresponds to the iteration 


at) = go) _ (F'(2))-1 F(a), (8.29) 
which is Newton’s method. ] 


Theorem 8.5 leads to the following local convergence result for Newton’s 
method. 


THEOREM 8.6 

Assume that F : DC R” > R” is Fréchet differentiable on an open neigh- 
borhood So C D of a point «* € D for which F(x*) =0. Also, assume that 
F' (x) is continuous at x* and that F'(«*) is nonsingular. Then x* is a point 
of attraction of Newton’s method (8.29). 


PROOF By Theorem 8.5, with A(x) = F’(x) for x € So, we conclude that 

G(x) = x — (F'(x))~! F(z) is well-defined on some ball S(x*, 5) C $o,6 > 0. 
In addition, p(G’(a*)) = o = 0. Therefore, by Corollary 8.1, x* is a point of 
attraction. 


8.3.2 Convergence Rate of Newton’s Method 


We now examine the rate of convergence of Newton iteration. 


PROPOSITION 8.1 
Assume that the hypotheses of Theorem 8.6 hold. Then, for the point of 
attraction of the Newton iteration (whose existence is guaranteed by Theo- 
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rem 8.6), we have 
(k+1) _ 
eee (8.30) 


: 
oo a) — 2] 


Moreover, if for some constant ¢, 
||F’ (2) — F'(a*)|| < ella — x*|| (that is, F’ is Lipschitz continuous) (8.31) 


for all x in some neighborhood of x*, then there exists a positive constant c 
such that 
]2FtD — 2 || < ella — x*|[?. (8.32) 


REMARK 8.6 _ If |ja+) — x*||/||¢™) — a*|| < a for all k sufficiently 
large, the convergence is said to be linear. Equation (8.30) indicates Newton’s 
method has superlinear convergence, and if (8.31) is satisfied, then Newton’s 
method is quadratically convergent near x”. 


PROOF Recall that the fixed point iteration function is G(#) = x — 
(F'(x))~'F(a) for Newton’s method. In Theorem 8.5, G was shown to be 
well-defined in some ball about «* and the F-derivative of G was shown to 
exist at x*. Then, for x“) in the ball of attraction, c+!) = G(a)) implies 


gt [ear eo [eal a 
This follows from the fact G’(a*) = I — (F’(a*))~!F’(a*) = 0 and from the 
definition of Fréchet differentiability. Thus, (8.30) is valid. 

Now, let ko be such that (8.31) holds in a ball containing {2")},5;,. For 
any such k > ko, consider the convex set consisting of points between x *) 
and «*. Using (8.31) and Lemma 8.3 (given following this proof), we obtain: 


|F(e@) — F(a") — F'(2*)(e® — 2*)|| < <a — 2" |[?. (8.33) 


Now, 


20) — 2 |] = |E@™) —2*|| = |e — (F'@®)) TFG) — 2*| 
< ||(F"(e*))“{F(e*) — F(2*) — (F'(a*))(a* — 2*)} || 
z We) ME") — F'(x*))(2* — 2*)}|| 
Ss EMG + @l|z* — a*)|? 


by (8.31) and ( 8.33) and because F(a*) = 0. Thus, with A = F’, the 
hypotheses of Theorem 8.5 are satisfied, so (8.25) (with k sufficiently large) 
implies that 

\|(F’(x*))—1|| < 2||F’(x*)—+|| = a constant. 
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This, in turn, implies (8.32). 


LEMMA 8.3 
Let F: DCR" — R” be continuously Fréchet differentiable on a convex set 
Do Cc D and suppose for some constant a > 0, F’ satisfies 


\|F’(u) — F’(v)|| < alju— vl] for u,v € Do. (8.34) 


Then, for any x,y € Do. 
a 
F@) — F@) - F’(@)y - 2)I| < slle— yl. 
PROOF We showed in the proof of Theorem 8.3 (on page 444) that 


Fy) F()= [Fe + sly a))ly ads. 


Thus, 


Fy) — F(z) — F(@)(y—2)|| = | [ e+ su—2) - Fey -a)ds 


< i |F’(e + (y — 2)) — F'(a)||Ily — allds 


<a f sly alPds = Fly — al. 


8.3.3. Example of Newton’s Method 


Here, we see an iterative method for computing the inverse of a nonsingular 
matrix. Let A be an n x n nonsingular matrix, and view A as an n?-vector 
(for example, by identifying the first row with the first n components, the 
second row with the second n components, etc.). Similarly, let be an n x n 
matrix, use the notation 2~! to denote the inverse of x, and view x and x7! 
as n?-vectors. With this notation, define 


(Then F(A~') = 0 so the solution to F(x) = 0 is x* = A~+). 
What is F’ (a2)? To find F’(x), we calculate 


fm IP@+y) =F) = F'@)yll _ 
iIyll+0 Ilyll 
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Proceeding with these computations, we obtain 


_ F(a +y) — F(a) - F'(a)yll 


lim 
Ilylio Ilyl| 
ne [erg] Ae A ay 
Ilylio Ilyl| 
— tm Mee gy =e = yl 
IIylico Ilyl| 


Now, observe that 


(x+y) =a(I+a7'y), so 


(2+y)'=(4+a@ "ye". 


Also, recall that 


(I+ B)'=I1-B+B’?—B +... if p(B) <1 


(see Theorem 3.5 on page 102 and Proposition 3.8 on page 103), so 
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(8.35) 


(I+a-ty)* =I-aty+(a7"y)? - (a "y)P +--+ if p(a7*y) <1, (8.36) 


where p(a~ty) < 1 whenever ||y|| is small enough. Therefore, the limit in 


(8.35) is equal to 


fim WE ety) tet = a = Fo) 
ul0 [wl 
im Wha ty + (ety)? = + emt = a = Fl 
ilo fl 
an Wether tye t+) = = Fol 
ul wl 
vn Let tye + tye = Foy 
ul0 Tul 


=0 provided F’(x)y = —27'yx7". 


Thus, F’(x)y = —x~'ya~+. But we need (F’(x))~! for Newton’s method. We 
have y = —(F’(x))~ta~'yax~+, from which we obtain (F’(x))~'k = —akza. 
where k = 2 tyr and y = +axkx. Therefore, Newton’s method has the 


form for this problem: 


ga (Gy Fa) 
= 2 4. 2® P(e) 2) 


(hth) 


= 7) 4 a) ((¢(*))-1 = Aja) = lh) 4 a) (I at Ax*)) 


or 
etl) — alk) — ®) Ag) for k= 0,1,2,--- 


? 


(8.37) 
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and x") + A~! as k > oo. For this method, x) converges quadratically to 
A~!. Notice that method (8.37) corresponds to method (3.82) described in 
section 3.4.11 (on page 167). 


8.3.4 Local, Semilocal, and Global Convergence 


In section 8.3.1, we proved a local convergence result for Newton’s method 
(Theorem 8.6 on page 450). In the next two sections, we will consider two 
other well-known results for Newton’s method. First, it is useful to understand 
that in a local convergence result, a solution «* is assumed to exist, and it is 
shown that there is a neighborhood about x* such that the iterates converge 
to «*. In a semilocal convergence result, it is shown that, for a particular 
choice of initial values, the iterates converge to a solution x#*. Finally, in a 
global convergence result, it is shown that, for initial values on a large subset,” 
there is convergence to a solution 2”. 


8.3.5 The Newton—Kantorovich Theorem 


The following semilocal convergence result is perhaps the most widely- 
recognized convergence theorem associated with Newton’s method. 


THEOREM 8.7 
(Newton—Kantorovich) Let F: DC R” > R” be F-differentiable on a convex 
set Do C D, let ||-|| be some norm, and assume that, for some constant y > 0, 


|F"(2) — Fy) < yile— yl] for all x,y € Do (8.38) 


(That is, assume that F' is Lipschitz continuous on Do.) Suppose that F'(a) 
is invertible for any x) € Do. Moreover, suppose that, for constants B,n > 0, 


(F'(2)) I< 8 (8.39) 
(ee) Fe) <a (8.40) 
Also assume that 
a= Byn < . (8.41) 
Set 
a 2a)? (8.42) 


and assume that 


B(x ,t*) = {« jx — 2 || < *} C Do. 


2or in the entire domain of F 
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Then, the Newton iterates 
tht) — ol) _ (FF (a)))-l B(e), k= 0,1,2,--- 


are well-defined, remain in S(a2),t*), and converge to a solution x* of F(x) = 
0 which is unique in S(a, t*). 
Moreover, we have the error estimate 


k =0,1,2,-°: (8.43) 


PROOF See [70]. 


8.3.6 A Global Convergence Result for Newton’s Method 


Global convergence of Newton’s method (existence of a unique solution to 
which Newton’s method converges from any starting point) occurs only under 
special circumstances. However, these circumstances occur commonly enough 
in practice to justify studying when such global convergence occurs. 

Various possible sets of assumptions lead to global convergence of New- 
ton’s method. We now present an example of one possible global convergence 
theorem. We first need a definition and a lemma. 

In the remainder of this section, inequalities between vectors are interpreted 
componentwise; for example, v > w means vy; > w;, 1 <i <n. Similarly, 
matrix inequalities are interpreted componentwise; for example, if A is an n 
by nm matrix, then A > 0 means aj > 0 forl <i<nand1l<j<n. 


DEFINITION 8.7  F is convex on a convex set Do if 


AF (a) + (1 —A)F(y) > F(Ar + (1 A)y) 


for all \ € [0,1] and x,y € Do. 


We can now present 


LEMMA 8.4 
Let F: DCR" — R” be F-differentiable on the convex set Do C D. Then F 
is convert on Do if and only if 


F(y) — F(x) > F'(x)(y— 2x) for x,y € Do. (8.44) 
PROOF First suppose that (8.44) holds. Fix z,y € Do and X € [0,1], 
and set z = Ax + (1 — A)y. Since Do is convex, z € Do and (8.44) imply 
(i) F(x) — F(z) > F'(z)(a — z) and 
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(it) F(y) — F(z) 2 F'(z)y- 2). 
Multiplying (2) by A and (2) by (1 — A) and adding, we obtain 


AF (x) + (1— A)F(y) — F(z) 2 AF"(z)(@ — 2) + (1— A)E"(z)(y — 2) 
= F'(z)[Ar+ (1- A)y— z] =0. 


Hence, 
AF (x) + (1 — A)VF(y) 2 F(z) = F(A + (1 — A)y), 


which shows that F' is convex. 
Now suppose that F’ is convex on Do, and let 0 < A < 1. Then we can write 


F(Ay + (1-A)a) < AF(y) + 1 -— A)F (a2) 


in the form 


= (F(e+ My = 2)) - F(z) < Fy) - F(a). (8.45) 


By F-differentiability of F’, it follows that the left side tends to F’(x)(y — x) 
as \ — 0. Also, since, if a(A),b € R” are such that a(A) < b for all A 4 0 and 
lim a(A) = a, it follows that a < b (componentwise). Thus, (8.45) implies 


— 


(8.44). U 


We can now state and prove an example of a global convergence result for 
Newton’s method. 


THEOREM 8.8 

Let F : R” — R” be continuously Fréchet-differentiable and convex over 
all of R". In addition, suppose that (F’(x))~ exists for all « € R” and 
(F’(x))~! > 0 for all x € R”. Let F(x) = 0 have a solution x*. Then x* is 
unique, and the Newton iterates c«®t) = 2) — (F'(2™))-1F(a™)) converge 
to x* for any initial choice x € R”. Moreover, for all k > 0, 


a <a") <a) for k=1,2.... (8.46) 


(Throughout this theorem statement, the inequalities are interpreted compo- 
nentwise. ) 


PROOF First, we show by induction that (8.46) holds. Let 2) € R” be 
arbitrary. By hypothesis, all the Newton iterates are well-defined, and 

gD) = ») (F’ (x) F(e2). 
Then, by Lemma 8.4, 


F(a) = F(a) = F' (x2) (a = x) = —F(x), 
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so F(a) > 0. But again by Lemma 8.4, 
0= F(a*) > F(a) + F(a) (2* — 2), 
It then follows from (F’(«“)))~! > 0 that 
0> (F’(e@™))- F(@®) + (a* — 2), 


Thus, we conclude that «* < «“). For general k, it follows exactly as above, 
and we obtain 


a* <2 and F(x) >0 fork =1,2,---. 


But 
gt) = 7 lk) (F’(e™))- F(a) < of), 
Therefore, 8.46 holds. 

Now, we need to show that x") — a* as k > oo. First, note that a) 
is monotonically decreasing and bounded below by 2% for each i, 1 <i <n. 
Thus, 2) — y = (y1,y2,-*+ ,Yn)’ as k > oo. Furthermore, since F’(z) is a 
continuous function of x, 


Fly) = jim F(#)) = jim F'(x®) (2+) — 2) (Newton iterates) 
= F'(y)0 = 0. 


Hence, y is a solution of F(x) = 0. But F(x) = 0 has only one solution 2*. 
To see this, let x* and y* be two solutions; then, by Lemma 8.4, 


0 = F(a") — Fly") = F'(a*)(y* — 2"), 


and multiplying both sides by (F’(x*))~! > 0 gives y* < x*. Reversing the 
roles of x* and y*, we obtain «* < y* and thus 2* = y”*. 


Example 8.3 

(One Dimension) Let F(x) = 7+e”+a for some a € R. Then F is convex for 
alla € R. To see this, one can see, e.g. from Taylor’s Theorem, that e* > 1+z 
for all z € R. Thus, e~*) —1 > y— 2, so e¥ — e® > e*(y — x). Therefore, 
yte*+a—x—e*—a > (1+e”)(y—2), and we have F'(y)— F(a) > F’(x)(y—2) 
for x,y € R. Now, consider F’ (x) = 1 + e”. We have 


(F(@))" = 


Finally, the Intermediate Value Theorem implies that there is an x* € R such 
that F(x*) = 0. Hence, Theorem 8.8 can be applied, namely, the iterates 
defined by 


>0 foralzveéeR. 


1 
pH) — g) _ (=) (2 4 4a) 


converge to x*. 
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8.3.7 Practical Considerations 


The following algorithm combines Newton’s method with a few practical 
considerations. 


ALGORITHM 8.1 
(Newton’s method) 
INPUT: 


(a) an initial guess 2; 
(b) a maximum number of iterations M. 
(c) a domain stopping tolerance €q and a range stopping tolerance €,. 


OUTPUT: either “success” or “failure.” If “success,” then also output the 
number of iterations k and the approximation «+ to the solution 2*. 


1. “success” <— “false”. 
2, FORK =0 to M. 
(a) Evaluate F'(2)). (That is, evaluate the corresponding n? partial 
derivatives at x.) 
(b) Soe F'(e)v® = —F(2™) for v™, 


o IF F’(a))\u) = —F(a)) cannot be solved (such as when 
F' (a) is numerically singular) THEN EXIT. 


(c) ek) — alk) 4 yl), 
(d) IF (|v || < ea or ||F(e@+))|| < e,) THEN 
i. “success” — “true”. 
wi. EXIT. 
END FOR 


END ALGORITHM 8.1. 


8.3.7.1 Advantages of Newton’s Method 


(a) If F’(a2*) is nonsingular, then a domain of attraction exists. (Thus, if a 
Newton iterate lands in the attraction ball, the successive iterates will 
remain in the ball and eventually converge to 2*.) 


Pei 
Sy 


The convergence is generally superlinear, and if F’ satisfies a Lipschitz 
condition at x*, the convergence is quadratic. (Recall that, in quadratic 
convergence, the number of significant digits in «”) as an approximation 
to x* is doubled each iteration.) 
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(c) Newton’s method is “self-correcting” in the sense that «+ only de- 
pends on F and «“), so bad effects (such as the effects of roundoff error) 
from previous iterations are not propagated.? 


8.3.7.2 Disadvantages of Newton’s Method 


(a) The attraction ball may be very small, so a good initial approximation 
to «* may be required. 


(b) We need to solve a linear system of size n at each step, which requires 
n? work for a dense system. 


(c) The Jacobian matrix is required at each step, which requires evaluation 


of n? scalar functions a i for a dense system.* 
J 


8.3.7.3 Modifications to Newton’s Method 


How can we make the procedure faster or more computationally convenient? 
Some possibilities are: 


1. Approximate the partial derivatives in F’(2™)), e.g. 


dfi(x)) 
Ox; 
k k k k k k k 


h ’ 
where h is small. However, it can be proved, under certain conditions 
on F, that the method is only linearly convergent.? Nonetheless, this 
method may be reasonable for black bor systems, that is, systems for 
which only the values of f can be obtained (and not the equations, or 
the computer program that evaluates f). 


2. Use automatic (algorithmic differentiation) (as we introduced in Sec- 
tion 6.2, starting on page 327) to compute F”. 


3. Solve F'(2™)y = F(x) approximately. 


As associated software improves, algorithmic differentiation (item 2) is 
increasingly replacing finite-difference approximations (item 1) in Newton’s 


’This is in contrast to, say, an unstable method for solving an initial value problem. 
4However, these partial derivatives do not necessarily need to be programmed by hand; 
moreover, automatic differentiation techniques can take advantage of the structure of the 
system to reduce the total amount of computation required. 

5 Finite differences have been used in the past to remove the need to manually compute and 
program partial derivatives. However, this reason for using finite differences has disappeared 
for many applications, for which automatic differentiation or derivatives produced with 
computer algebra systems can now be used. 
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method. Not only is quadratic convergence improved, but considerably less 
than n° operations are required to compute the Jacobian matrix when the 
system is structured (i.e., when the Jacobian matrix has a sparse structure). 

Approximately solving the system (item 3), especially with iterative meth- 
ods, remains a popular technique, especially for very large, structured systems 
of equations. In the remainder of this section, we consider this in more detail. 
First, consider iterative solution of the system Ax = b. Recall that the SOR 
method can be written in the form: 


gt) — gl) _ 9(D —oL)~1(Ax™ —b) for m=0,1,2,--- (8.47) 
with 2 = 2p given, where 
A=D-L-U. (8.48) 


Recall that when o = 1, we obtain the Gauss-Seidel iterative method. 

One way to use SOR for nonlinear problems is by approximately solving 
the linear system present at each step of Newton’s method. In this case, the 
primary iteration is Newton’s method and the secondary is SOR. We call this 
the Newton-SOR method, which we now describe. 

In Newton’s method, «+) = a(*) —(F"(2)))-1 F(a) for k = 0,1,2,++°. 
This can be written as 


F’ (e204) = Fle) — Fie), (8.49) 


We solve (8.49) approximately using SOR. To do this we decompose the Ja- 
cobian matrix F’(a*)) as 


F' (x) = Dy — Le — Up, (8.50) 


and specify some relaxation parameter a0, 0 < ox, < 2. To apply SOR to 
(8.49), we denote the m-th SOR iterate by x2’, m = 0,1,2,--- and apply 
(8.47) with A = F’(a)), by = F’(a™ a — F(x), to obtain 


ohm = gh) _ oh (De — onLn)*(F'(x*)0*™! — dg) (8.51) 


for m = 1,2,3,.... A natural choice for x° is to let it equal «* of the previous 
Newton iterate. Finally, we assume the SOR iterations are terminated after 
mp iterations, and set e*t! = gh™er, 

How is mz chosen? We could choose m; by terminating the SOR iterations 
by a convergence criterion such as ||x*"" — 2*"—1|| < € for some specified ¢; 
then, mz varies with k. We could also specify mz in advance. The simplest 
choice is mz = 1, which leads to the 1-step Newton—SOR iteration: 


at) — gl’) — 6),(Dp — onLy) F(a"), k=0,1,---. (8.52) 


Furthermore, if o, = 1, method (8.51) is called the Newton—Gauss—Seidel 
Method. Note that the 1-step Newton—Gauss—Seidel method only requires 
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evaluation of D; — op, i.e., the evaluation of atntt) 


partial derivatives 
at «*) while the m-step Newton-SOR method (m > 1) requires n? partial 
derivatives at x"), assuming the system is dense. 

There is a second way of extending an iterative method for linear systems to 
nonlinear systems. Consider, first, the Jacobi method, perhaps the simplest 
iterative scheme for linear systems. Recall for Ax = 6, the Jacobi method 


consists of 


oh) — a (4 i=1,2,---,n for k=0,1,2,---, (8.53) 
pee 
with g) given. 

Equation (8.53) can be interpreted as solving approximately the i-th equa- 
tion of the system Ax = b for unknown 2;, holding fixed all other unknowns 
xj, j At, at the k-th level, i-e., a), 

Consider now the nonlinear map F': D C R” — R”, where we seek x* such 
that F(a*) = 0 and F(x) = (fi(a),--- , fn(a))7. The analog of the linear 
Jacobi method is the nonlinear Jacobi method, where the unknowns xj, j ¥ 4, 
at the level x *) are kept fixed, the i-th equation for F(x) = 0 is solved for 2;. 
That is, we solve 


fila Pee _ oy Ui, oe al =0 (8.54) 
for 2; for 1 <i <n (substituting the current values ,j #1). We call the 
resulting vector 2 = (21,22,--- ,2n)", thus defining the next iterate: 

alk +1) = (FFD) glRHD) ae eye > (x1, TQ,° °° en ae 


For each i, (8.54) is just a single nonlinear equation with one unknown 2;, 
and can be solved, for example, by applying m,z steps of Newton’s method 
(applied to the scalar nonlinear equation). Thus, the j-th Newton step has 
the form: 


k k 1) (k k 
wy gay BED ea eee tee) 
Uy = a; Se (8.55) 
Of oa eee) 
Ox; 
where j = 1,2,---,mx, and w) is taken as ao), We are thus led to the 


Jacobi-Newton Method. 

For example, the 1-step Jacobi-Newton Method involves applying for each 
i, 1 <% <n, one step of Newton’s method for approximately solving (8.54) 
for each xj. 

In an analogous manner, we can define Gauss—Seidel-Newton or SOR- 
Newton methods. For example, in Gauss—Seidel-Newton methods, for the 
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i-th equation, we solve 


Fila) oh), arts oD a, a”), 5) ,ai*)) =0 (8.56) 
Note: k+1 Note: k 


(k+1)_ 


a 


for x; by applying mz steps of Newton’s method, and we call the result x 
More generally, if after finding x; from (8.56), we set 


ohh) = ol) + on (Ui — a") (8.57) 


for some parameter ox, we obtain the SOR—Newton Method. The one-step 
SOR method has the form 


-1 


(abt) 
et) — gl) _ oy f,(arlb) ee ) (8.58) 
Ox; 
where v(*) = (RTD HY), ve okt) lh) ve al )P, (See Exercise 11 


below.) 

Hence, the 1-step SOR—Newton method requires, at each step k, the evalu- 
ation of the n functions f;(x\*")) as well as the n derivatives Of;(a))/O2;. 
(Contrast this with the number of component function evaluations required 
for Newton—SOR method.) 


REMARK 8.7 _ Note the difference between e.g. the Newton—Gauss—Seidel 
Method, in which we solve the linear system arising from the multivariate 
Newton method with Gauss-Seidel iteration, and the Gauss—Seidel-Newton 
Method, in which, in principle we solve the i-th nonlinear equation for the 
i-th variable without first replacing it by a linear approximation. 


REMARK 8.8 _ For general matrices, iterative methods for solving linear 
systems of equations need to be preconditioned first. However, certain appli- 
cations occurring in practice do not require preconditioning. For example, the 
linear systems arising from discretization of the heat equation do not require 
preconditioning for the Gauss-Seidel (and Jacobi and SOR) methods to con- 
verge. If “mild” nonlinearities (that is, linearities that are not large in relation 
to the other terms in the equations) are introduced into such systems, we ob- 
tain nonlinear systems for which the Gauss—Seidel-Newton method will con- 
verge. Convergence, even local convergence, cannot be expected for general 
nonlinear systems when the Jacobi-Newton method, Gauss—Seidel-Newton 
method, or SOR—Newton method is used. 


REMARK 8.9 These composite methods, e.g. SOR-Newton, do not 
possess the superlinear convergence rate of Newton’s method. In fact, they 
converge linearly, if they converge. However, other convergence results (e.g. 
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local, global, etc.) for these composite methods that are similar to Newton’s 
method can be obtained under certain assumptions. 


8.4 Multivariate Interval Newton Methods 


Multivariate interval Newton methods are similar to univariate interval 
Newton methods (as presented in Section 2.5, starting on page 54), in the 
sense that they provide rigorous bounds on solutions, in addition to existence 
and uniqueness proofs [44, 62]. Because of this, multivariate interval Newton 
methods have a good potential for computing mathematically rigorous bounds 
on a solution to a nonlinear system of equations, given an approximate solu- 
tion (computed, say, by a point Newton method). Interval Newton methods 
are also used as parts of more involved algorithms to find all solutions to a 
nonlinear system, or for global optimization. (See the section on branch and 
bound algorithms, on page 523 below.) 

Most multivariate interval Newton methods follow a form similar to that 
of the multivariate point method seen in Formula 8.22. To explain this, we 
introduce two preliminary definitions. 


DEFINITION 8.8 Suppose F : D C R” — R”, and suppose x € D is 
an interval n vector (i.e., a “boxz”). Then an interval matriz A is said to be 
a Lipschitz matrix for F over x, if and only if, for each x € x andy € @, 
there is an A € A such that 


F(y) — F(@) = Ay— 2). 


You will show in Exercise 16 (on page 484 below) that matrices formed 
from interval extensions of the partial derivatives are Lipschitz matrices. 


Example 8.4 

Suppose F(x) = (f1(x1, 22), fo(t1, 2))*, where 
fi(ai, £2) = x _ nA = 1, 
fo(a1, v2) = 22122. 


Then the Jacobian matrix is 


Pe e 2), 


222 224 
and a Lipschitz matrix for F over the box # = ([—0.1, 0.1], [0.9, 1.1])” is 


' 2[—0.1,0.1] —2[ 0.9,1.1] [—0.2,0.2] [—2.2,-1.8] 
Bai ey & 0.9, 1.1] in aa) r (| 1.8,2.2] [—0.2, ne 
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(We use F’(a) to denote an elementwise interval evaluation of the Jacobian 
matrix for F’.) 


DEFINITION 8.9 Suppose F : D C R” — R”, suppose x € D is an 
interval n vector, and suppose & € D. Then an interval matrix A is said to 
be a slope matrix for F at & over x if and only if for each x € aw, there is an 
A€A such that 


F(x) — F(z) = A(x — 4). 


Slope matrices can have narrower entries than Lipschitz matrices, so they 
may lead to results when Lipschitz matrices do not. However, slope matrices 
can be somewhat more complicated to compute than Lipschitz matrices, and 
they are trickier to use in processes that prove uniqueness. 

The general interval Newton method can now be stated as 


DEFINITION 8.10 Suppose F : D C R” — R”, suppose x € D is an 
interval n-vector, and suppose that A is an interval matrix such that either 


1. A is a slope matrix for F at & over x, or 
2. %€D, and A is a Lipschitz matrix for F over x. 


Then a multivariate interval Newton operator F' is any mapping N(F,2x,<) 
from the set of ordered pairs (a, £) of interval n-vectors x and point n-vectors 
& to the set of interval n-vectors, such that 


z—N(F,2,%)=i+4+0, (8.59) 


where v € IR” is any box that bounds the solution set to the linear interval 
system 


Av = —F(). (8.60) 


REMARK 8.10 In implementations of interval Newton methods on com- 
puters, the vector F'() is evaluated using interval arithmetic, even though the 
value sought is at a point. This is to take account of roundoff error, so the 
results will be mathematically rigorous. 


An immediate consequence of Definition 8.10 is 


PROPOSITION 8.2 

Suppose F' has continuous first-order partial derivatives, and N(F,x,&) is 
the image under an interval Newton method of the box x. Then any solutions 
x* € « of F(x) =0 must also lie in N(F, x, %). 
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PROOF |The proof is a consequence of the definition of a Lipschitz matrix 
or of a slope matrix, and is left as Exercise 17 on page 485 below. 


A uniqueness theorem can be stated in general for any interval Newton 
operator. We have 


THEOREM 8.9 

Suppose F and N(F,x,%) are as in Definition 8.10, suppose that N(F,x,&) C 
x, suppose that A is chosen to be a Lipschitz matrix for F over x, and sup- 
pose that there exists an x € x such that F(x) = 0. Then there is no other 
y € ax with F(y) = 0 (that is, x is unique). 


PROOF WN(Ff,2,%) C x implies that the interval enclosure v to the 
solution set to (8.60) is bounded. That, in turn implies that every A € A is 
nonsingular. That said, assume that there is a y € w, y 4 x with F(y) = 0. 
Then 


0= F(y) — F(a) = A(y—2) for some A€ A, 


since A is a Lipschitz set for F’ over x. However, this contradicts the fact 
that A must be nonsingular. Therefore, x must be unique. 


We now study some specific techniques for bounding the solution set of 
(8.60). In the process of studying practical aspects, we will relate multivariate 
interval Newton methods to the Gauss-Seidel method, to the nonlinear Gauss— 
Seidel method, to the contraction mapping theorem, to the Brouwer fixed 
point theorem, and to the Kantorovich theorem. 


8.4.1 The Nonlinear Interval Gauss—Seidel Method 


The nonlinear interval Gauss-Seidel method can be viewed in two ways: we 
can view it either as an interval version of the Newton—Gauss—Seidel method 
(where we are solving a linear system of equations in n variables) or as an 
interval version of the Gauss—Seidel-Newton method (where we are solving n 
nonlinear systems of one variable). Each of these views is advantageous for 
revealing different properties of the method. 


8.4.1.1 As Newton—Gauss-—Seidel 


If we view the nonlinear interval Gauss-Seidel method as an interval version 
of the Newton—Gauss-Seidel method, then we will use the interval Gauss— 
Seidel method of Section 3.4.5 (starting on page 153) to bound the solution 
set to (8.60). In particular, if we apply the iteration scheme (3.58) (the interval 
version of the Gauss-Seidel method, on page 154) to the interval linear system 
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(8.60), assuming we precondition (8.60) to 

(YA)v = -YF(3%), 
we obtain 


vO — w— &, 


i-1 n 
k 1 . k k 
oD A 9 -(PFLH), — LOA — SO OAD of? 
a j=l j=itl 


for 71 =1,2,...,n. 
(8.61) 
For example, we can use an interval derivative matrix as a Lipschitz matrix. 
In that case, let Of ;/Ox;(x) denote an interval enclosure of the range of 
the j-th partial derivative Of ,/Ox,; of f; over x (such as can be obtained by 
evaluating an expression for Of ,/Ox; with interval arithmetic), and denote by 
F(x) the corresponding matrix. Then, 


If we further replace vo) by a; — &; in the iteration equation (8.61), then 
(k+1) 


solve for x; , we obtain 


_ lh) = 1 


a (Y F'(al*))) 5; 
{(7Fe), n SV E(a™)) ij Ce = =) 


+ S (VF(2)),5 (a - 2) 


(8.62) 
In (8.62), @+) can be chosen to be the midpoint of #*+1), although other 
choices are possible, and sometimes advisable. 


ght) 


fori =1,2,...,n. 


8.4.1.2. As a Univariate Method with Uncertainty 


The iteration (8.62) can also be derived with a multivariate version of the 
mean value extension f,,,,, considered in Problem 26 on page 33. We have 


PROPOSITION 8.3 
Suppose f; : DCR” — R", x& C D is an interval n-vector, f; has continuous 
partial derivatives, Of ;/Ox;(x) denotes an interval enclosure for the range 
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of Of ,/Ox; over x, forl<j<n, and & € a; further, define 


Feme(@) = flB) + 2 34 (a) (@; — 4). (8.63) 
j 
Then F iy (#) contains the range of f; over x. 


PROOF _ The proof follows directly from the multivariate mean value 
theorem (Theorem 8.1 on page 441); you will fill in the details in Exercise 20 
on page 485. 


REMARK 8.11 fj yy is called a multivariate mean value interval exten- 
sion of fi. 


REMARK 8.12 Slope matrices, as in Definition 8.9, can be used instead 
of interval enclosures to partial derivatives. In that case, % need not necessarily 
be in a. 


With the multivariate mean value extension, we can consider f; (or, for 
preconditioned systems, (YF');) to be a function of x;, with uncertainty in 
its values introduced by the variables 7;, 7 #7. Base the multivariate mean 
value expansion about the point 

t= (&1, As » U1, t, Liga, oe ply) 
so that the mean value extension becomes 
pi(t) = (YF)s(a1,..., 04-1, t, £i41,...2n) 


€ (YF)i(#) + Het 2p py ENG 


j#i 
= (YF); (%1,...,£i-1,¢, B41,---,4n)) +I, (8.64) 
where & = (#1,..., 44-1, t, ¥i41,...,%n)7 and where we interpret 
ay F); " (YF), 
f= he, 1) d. Bae — £5) 
a#i 


as an interval of uncertainty that is “constant” with respect to the variable t. 
Then, identifying y; in (8.64) with f and identifying t with z in the derivation 
of the univariate interval Newton method (Equations (2.10) and (2.11) on 
page 54), we obtain the interval Gauss-Seidel method as in Equation (8.62). 
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8.4.1.3 Existence and Uniqueness Verification Theory 


Existence and uniqueness theory for the interval Gauss-Seidel method and 
other interval Newton methods is covered in detail in [62]. We presented® 
a relatively simple proof of the existence-proving properties of the interval 
Gauss-Seidel method in [44]. We present this proof here, since it is based on 
Miranda’s theorem, useful on its own. 


THEOREM 8.10 
(Miranda’s Theorem) Suppose 


Ly [z1, 21] 
r2 [Lo, £2] 
r= 4 = Fj € IR” 


is an interval n-vector, and define the faces of x by 


r= (a1,. ++, %j-1,2;,%i41,-- -) Bn) 


ey kn AE et ee) 


Further, let f = (fi,..-, fn)? : 2 = R” be continuous, and denote by f}(y) 
the range of f; over an interval vector y € IR”. If 


File) fi(az) <0, for each i, 1<i<n, 
then there is an x € a such that f(x) =0. 


Miranda’s theorem is a consequence of the Brouwer fixed point theorem, 
which we state on page 471 later.’ 


We now state our existence theorem for the interval Gauss-Seidel method. 


THEOREM 8.11 


Suppose x) is defined through (8.62) (page 466) with k = 0, suppose that Y 
is nonsingular, and suppose that e&) C #. Then there is an x € #2 such 
that F(x) = 0. 


6We do not claim that this proof originated with us; these existence and uniqueness results 
are ubiquitous in the literature on interval Newton methods. 

‘The original presentation of Miranda’s theorem is in C. Miranda, “Un’ osservatione su un 
teorema di Brouwer,” Bol. Un. Mat. Ital., Series 2, pp. 5-7, 1940. 
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PROOF _ Using the multivariate mean value theorem and along the lines 
of (8.64), we have 


(Paley) = WP ---.0) + (PSH ole a} (865) 


j=l vy 
Al 
YF 
+ Ox (c) (x, os £,), 
where ¢ is some point on the line between (x, %2,...,@n)7 and (#1,...,%n)", 
and 
: , Nf OCF 7 ‘ 
(PFha(on) = WP Gn) + | ele, — a} (6.6 
jAl 
OW Pip ok 
i Dn, > (¢)(Zj — 5), 
where T is some point on the line between (%1,,#2,...,%n)? and (#1,...,%n)". 
Now, using the notation in the statement of Miranda’s theorem, 
(YF) (a1)(¥F)i (a7) <0 (8.67) 
if and only if 
(YF): (@1)(Y F)i (#7) <0 (8.68) 


for every x1 € @ and every x7 € @7. 

Now observe that, in (8.65), O(Y F),/0x;(c) Z 0, since the denominator in 
the first step of the interval Gauss—Seidel method (8.62) is (OY F),/Oz;(x), 
which contains O(Y F), /Oz;(c) by the fundamental theorem of interval arith- 
metic, and since there cannot be a zero in the denominator of (8.62) if the 
result is a bounded interval. Therefore, 


(YF) < (OY F), 


Oa; or “On, > 0. 


either 


If (OY F),/Ox;(x) > 0, then solving (Y F);(a1) > 0 for x, in (8.65) gives 


: 1 ' : “~ f YF), F 
2 S41 AVF, Jono) (VP) (B1y---sn) + Of Oe (ole; - 2) 


ia 
(8.69) 
However, the right member of (8.69) is contained in a), and the lower bound 
of a) is greater than x, by the hypothesis to the theorem; therefore, (8.69) 
is true, and (Y F)}(#1) > 0. Similarly, in the same case (OY F’),/Ox;(a) > 0 


we can use (8.66), the fundamental theorem of interval arithmetic, and the 
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hypothesis of the theorem to show that (YF)!(az) < 0. Thus, the part 
of the hypothesis to Miranda’s theorem corresponding to 7 = 1 holds when 

A similar argument holds for 1 = 1 when (OY F)/Oz,;(x) < 0. You will 
fill in the details in Exercise 22 on page 485. 

For i = 2, the box 2 must be replaced by a smaller box contained in 
xz, and the above argument holds over this sub-box. We then repeat the 
process for 7 = 3,...,n. Since the hypotheses of Miranda’s theorem then 
hold for (Y F’) over some sub-box of a, it follows that there is a solution of 
(Y F)(x) = 0 in this sub-box, and hence in zx. 

To complete the proof, we observe that, if Y is nonsingular, then the only 
solution to Yv = 0 is v = 0; therefore, if YF = 0, it follows that F' = 0. 


REMARK 8.13 More powerful existence results can be proven for the 
interval Gauss-Seidel method; see, for example, the results in [62]. However, 
we have presented the above theorem since it is relatively easy to prove, gives 
a result that is useful in practice, and for which it is easy to see the main idea. 


8.4.2 The Multivariate Krawczyk Method 


The multivariate Krawczyk method is a tight analogue of the univariate 
Krawczyk method, introduced in Problem 29 on page 81. In particular, we 
set 

G(«) =«-—YF(a), (8.70) 


where Y is an approximation to (F’(z))~+, for some point % near x. In matrix 
form, 


G!(x) =I -YF"(2). (8.71) 


(See Problem 23 on page 485). Applying our multivariate mean value theorem 
(Theorem 8.1 on page 441) to (8.70), then replacing the matrix A in (8.1) by 
a matrix F’(a) whose entries represent interval enclosures for the range of 
corresponding entries of F’ over x, we thus obtain 


G(x) € G(#) + (I - YF'(ax))(x — 2) 
Sho VEG)ATHY ER @Go®): (8.72) 


This leads us to 


DEFINITION 8.11 = (The multivariate Krawczyk operator) 


e+) — K(P.2® 2) — 2 yr“) + (1 - YF (@™)) (a — 2) 
(8.73) 
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is called the Krawczyk method, where the operator K(F, a), #)) is called 
the Krawczyk operator. 


REMARK 8.14 _ The point <‘) is often chosen to be the vector of mid- 
points of components of a"), while Y is often chosen to be the inverse of 
the matrix whose entries are the midpoints of corresponding entries of F’ (x). 
These choices endow the Krawczyk method with certain symmetries and nice 
theoretical properties. 


We have 


THEOREM 8.12 
Suppose K(F, a), #)) c «*) and Y is nonsingular. Then there exists an 
x € « such that F(x) = 0. 


REMARK 8.15 The condition Y can be removed, but the proof of the 
theorem is slightly simpler if we make this assumption. 


This theorem is most easily proven with 


THEOREM 8.13 

(The Brouwer fixed point theorem) Suppose D is a closed, bounded, convex 
set, suppose G: D + R”, where G is continuous, and suppose G(x) € D for 
every x € D. Then there is an «* € D such that G(a*) = «*, that is, G has 
a fixed point in D. 


REMARK 8.16 With introduction of additional terminology from al- 
gebraic topology, the Brouwer fixed point theorem can be stated somewhat 
more generally, but this statement is sufficient for our purposes. 


REMARK 8.17 The Brouwer fixed point theorem is due to Felix Brouwer, 
one of the fathers of algebraic topology, in 1909, but is now common knowledge 
and is widely used among mathematical economists, analysts, etc. 


REMARK 8.18 The Brouwer fixed point theorem is a partial strength- 
ening of the contraction mapping theorem (Theorem 8.2 on page 442). In 
particular, one of the assumptions in the contraction mapping theorem is 
that G map D into itself. That is the only assumption in the Brouwer fixed 
point theorem. However, the conclusion of the Brouwer fixed point theorem 
is somewhat weaker, since the Brouwer fixed point theorem doesn’t mention 
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an iteratively defined sequence that converges to the fixed point, nor does it 
claim that the fixed point is unique. 


Now, we prove Theorem 8.12. 


PROOF Equation (8.72), the definition of K(F, a), #)), and the fun- 
damental theorem of interval arithmetic (Theorem 1.9 on page 26) imply 
G(x) € K(F,2),%)) for every 2 € x. Combining this with the hypothesis 
of Theorem 8.12 then gives G(x) € «x for x € w. The Brouwer fixed point the- 
orem therefore implies that G has a fixed point «* € x, G(x*) = x*. However, 
this implies that Y F(x) = 0. The conclusion of Theorem 8.12 now follows 
from the nonsingularity of Y. 


The condition K(F, a), «)) C a can be combined with other conditions, 
such as ||J — Y F’(x*))|| <r < 1, to imply that Krawczyk iteration converges 
to a small interval bounding the unique fixed point.® 


8.4.3 Using Interval Gaussian Elimination 


Interval Gaussian elimination, as we saw in Section 3.3.7 (page 130), can 
be used to bound the solution set to the interval linear system (8.60) (on 
page 464) in the general multivariate interval Newton operator. Under cer- 
tain circumstances, interval Gaussian elimination gives better results than 
the Krawczyk method and the interval Gauss-Seidel method, but the interval 
Gauss-Seidel method gives better results in other situations. See A. Neu- 
maier, Interval Methods for Nonlinear Systems, Cambridge University Press, 
1990 for details. 


8.4.4 Relationship to the Kantorovich Theorem 


The Kantorovich theorem (Theorem 8.7 on page 454) has conclusions that 
are similar to the conclusions for the existence and uniqueness theorems for 
interval Newton methods. In particular, if a certain set is bounded inside 
another set, then existence of a solution within a particular region is assured. 
There are several papers in the numerical analysis literature, such as [72] 
(1980) and [64]. In the latter, it is shown that, if slopes are used instead of 
Lipschitz sets, the existence test based on the Krawczyk method gives results 
whenever the Kantorovich theorem does, and sometimes gives results when 
the Kantorovich theorem does not. 


8For details, see A. Neumaier, Interval Methods for Nonlinear Systems, Cambridge Uni- 
versity Press, 1990, R. E. Moore, “A Test for Existence of Solutions to Nonlinear Systems, 
SIAM J. Numer. Anal. 14 (4), pp. 611-615 (September, 1977), etc. 
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8.5 Quasi-Newton Methods (Broyden’s Method) 
Many point iterative methods for finding solutions to F(a) = 0 have the 


form 
he = —H®) F(x) 


ah+l) = gl) 4 p(t) for k= 0,1,2,--°, (8.74) 


where H“*) is an n x n matrix and t“) is a scalar. For example, if H() = 
(F’(a*)))-1 and t) = 1, (8.74) defines Newton’s method. 

However, H‘") may be be chosen to satisfy certain conditions such that 
some properties of H“) approximate those of (F’(x\)))~1. Such methods, 
called quasi-Newton methods, are regarded as variations of Newton’s method. 
We will study a particular quasi-Newton method called Broyden’s method 
[25]. We will see that Broyden’s method reduces by an order of magnitude 
the O(n?) scalar function evaluations and the O(n?) operations involved in 
solving the linear system at each iteration of Newton’s method. 

We now derive Broyden’s method. We assume: 


(a) F is continuously differentiable on an open set DC R”. 
(b) For given x € D and givens £0, at =2+5€D. 


We associate x with x) and «+ with «+, and we seek a good approx- 
imation to F’(x(*t), Since F’ is continuous at a+, given € > 0 there is a 
6 > 0 such that 


F(x) — F(a*) — F'(a*)(x—2*)|| < ella — 2" ||, 
provided ||z — xt || < 6. It follows that 
F(a) = F(at)+ F'(xt)(x— 2), 


with the approximation improving as ||z — x*|| decreases. Hence, if Bt is to 
denote an approximation to matrix F’(#*), it is natural to require that BT 
satisfy 
F(a) = F(at)+ Bt(a—2*), 
that is, 
Bts=y=F(at)— F(z), where s=2* —g. (8.75) 


Equation (8.75), called the secant equation or quasi-Newton equation, is cen- 
tral to the development of quasi-Newton methods. 


REMARK 8.19 _ If n = 1, (8.75) completely determines B*, and we are 
led to the secant method, i.e. 


2D) — f(o®) 
glk+l) — lh) 


f\(a@PtD) we bal = Bt. (8.76) 
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(You will show this in Exercise 24 on page 485.) 


For n > 1, the quasi-Newton equation deals with the approximate change 
in F(x) in the direction s = «+ — x. Now suppose that we have an approx- 
imation B to F’(x), ie, B ~ F’(x). Broyden assumed that Bt ~ F’(x*) 
approximately produces the same effect as B in any direction orthogonal to 
s. Thus, we assume that 


Btz=Bz if z?s=0. (8.77) 
It turns out that Equations (8.75) and (8.77) uniquely determine Bt from B. 
To find Bt, consider the matrix 
, with y= Bts from (8.75). 
Let v € R”. Then v = z+ as for some scalar a, because 


n 
R" = span(s, 21, 72,°°° staat) 


where 21, 22,°** ,2n—1 are orthogonal to s and z = - c,z;. Then, 
i=l 
(y — Bs)s™ (z + as) 
sls 
= a(B*s— Bs) (since sz =0) 
= Bt(z+ as) — B(z+as) 
= (Bt — B)v, 


Av = A(z +as) = 


since Av = (Bt — B)v for allvu € R", A= Bt — B. Thus, 


Br=Bt (8.78) 
Equation (8.78) provides what is known as the Broyden update to the approx- 
imate Jacobian matrix. 

An alternative way of viewing the above construction of B+ from B is to 
view 


1 
Br=B+ vs", v an arbitrary vector 
8's 


as a perturbation of B by the rank-one matrix vs? /(s7s) such that B+s = 
Bs+v, but Btw = Bw for every w with s’w = 0. We then see that, for 
Bts=y, we must have v= y— Bs. 

Another way to see that (8.78) is a good choice for Bt (subject to the 
condition that BT satisfies (8.75)), is that Bt given by (8.78) is the “closest” 
matrix to B in the Euclidean norm from all matrices that satisfy (8.75). This 
is stated in the following proposition. 
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PROPOSITION 8.4 
Given B ann xn matrix and y € R” and some nonzero s € R”, define Bt 
by (8.78). Then Bt is the unique solution to the problem 


min{||B - Biz: Bs = y}, where \|Allz = S> |aiyl?. 


ajal 


PROOF To show that B* is a solution, note that if y = Bs, then 


A 1 A gst A 
Bt - Blle = le — Beas! | < [BBN = |8- Ble (8.79) 


(from (8.78), defining the Broyden update). That B* is a unique solution fol- 
lows from the following argument: Suppose that B, and Be are two solutions 
and By # Ba, ie., 


|B: -Blle<||B-Blle and Bis=y, 


and 7 
||B2-Blle <||B-Blle and Bos=y 


for every B that satisfies Bs = y. Let B* = \B, + (1—A)Bz, where 2 is any 
number with 0< A < 1. Then 


B*s=y 
and 


|B" — Blle = ||A(Bi — B) + 1— A)(B2 - B)lle ; 
< Al|(Bi — B)|le + (1 — A)||(B2 - B) le < ||B — Bll e(8.80) 


(The proof of the first strict inequality depends on the fact that (B, — B) # 
A(B2 — B) for any A; see Remark 8.20 below.) Thus, ||B* — Bl|z < ||B—Bllz, 
which is a contradiction when B = B*. Thus, Bt is unique. 


REMARK 8.20 In general, if A and B arenxn (n > 1) nonzero matrices, 
A#aB for any scalar a, and 0 < A < 1, then 
[AA + (1— A) Bllz < AllAlle + 1 — A)/Blle- 


We say that the Euclidean norm is strictly convex. (This is to be shown in 
Exercise 25 on page 485.) 


We now consider how (8.75) and (8.78) can be used in an iterative method 
to solve F(a) = 0. The basic formulas for Broyden’s method are 


alkt)) — ol) _ Bl R(e)), &=0,1,2,--- (8.81) 
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y®) = F(2®tD) — P(e), 5) = ol) _ 2) (8.82) 


(y® — Bs)" 


Bri = Bet (kK)? g(k) 


(8.83) 


It is clear that, given x and Bo (either F’(x)) or a good approximation 
to F’(x©)), Broyden’s method can be carried out with n scalar function 
evaluations per time step, i.e., Broyden’s method requires only evaluation of 
F(x‘*+)) (and no partial derivatives) in each step. However, it appears that 
we still need to solve a linear system B;,s) = —F(a*)) at each time step. 
We can overcome this difficulty by using a result of Sherman and Morrison. 
First, we need the following lemma. 


LEMMA 8.5 
Let v,w € R” be given. Then 


det(I + vw?) =14+ wv. (8.84) 


PROOF Let P=J+vw’". If v =0, the result is trivial, so assume that 
v #0. Let z be an eigenvector of P, ie., (I + vw?)z = Az for some ); then 
(1 — A)z = —(w? z)u, ie., z is either orthogonal to w or is a multiple of v. 
(Thus, n —1 eigenvectors are orthogonal to w.) If w? z = 0, then \ = 1, while 
if z is parallel to v, X = 1+ w! v; thus, the eigenvalues of P are all 1 except 
for a single eigenvalue equal to 1+ w!v. Thus, (8.84) follows from 


det(P) = [ [Ai =1+ wv. 


i=1 


LEMMA 8.6 


(The Sherman-Morrison formula) Let u,v € R” and assume that the n x n 
matric A is nonsingular, Then A+ uvT is nonsingular if and only if o = 
1+v?A-tu40. Moreover, if ¢ £0, then 


(A+ uy?) = At ey ae (8.85) 
o 


PROOF | Since det(A + uv?) = det(A) det(I + A7tuv”) and A is non- 
singular, A+ uv! is nonsingular if and only if det(J + A~!uv7) 4 0. By 
Lemma 8.5, det(I + A7!uv?) =1+v7A-!u =o. To verify (8.85), we need 
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only show that if we multiply the right-hand side by A+ uv? we get I. Thus, 
(A+uv7)(A7t — =A tue? At) 
=I+uyF At = (uw? A + uv? A~tuv? Ag?) 
=I- <[-ow? A + uv? A7} + u(v? A7tu)v7 A] 
=I- <[-w? Ao! —v? A uu? A“! 


+ uv? A~1 + (v7 A u)uv? Am ] 
=I 


0 


Now consider the iterative procedure (8.81)—(8.82) in light of Lemma 8.2. 
We have 


(k) _ By g(*))g(®)7 
Bi =e 


glk)? o(k) ? 
which has the form 
(k) _ By, 86) 
Brit = By +uv', where u= aes ae eae 
glk)? g(k) 


Thus, by Lemma 8.6 (the Sherman—Morrison formula), 


1 
Bat = (Bet uv? )-1 = B? - — Btw! By", where 0 = 1+ v? By tu. 
Thus, 
1 4 — as yr 4 
B 24 _ po (kK)? g(k) (Bi, ¥ —§ )s B,, 
Bee r [Boty® — 5 
glk)” g(k) 
Hence, 


es ce 
(Bo) Bo * (sé es ty))s®) By, : 

k+1/  “~k 3h)” Bet y(k) —— 
Letting H;, = Bi and Ay41 = Bes we have 


(s* — Hyy™))s®)” Hy 


A441 => Ay + 3h)" Hypy®) 


(8.86) 


Therefore, Broyden’s Method can be implemented in the following manner: 


a't}) — () _ F(a), &=0,1,2,--- (8.87) 
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y®) = F(a*tD) — F(a), 5) = Ft) _ gf) (8.88) 
ms (s* — Hy)" Hy 
Here, x is an initial guess, and Ho = (F’(x)))—! or Ho is a good approx- 


imation to (F’(2)))~1. In the above form, Broyden’s Method only requires 
n scalar function evaluations, i.e., F(a), and O(n?) arithmetic operations 
per iteration, i.e., the matrix-vector multiplications involving Hy. 


REMARK 8.21 One problem with Broyden’s Method is that Byi1 may 
be singular for some k; in such cases s(*)" Hyy*) = 0 in (8.89). Broyden’s 
Method is then sometimes implemented in the following form rather than in 
the form (8.78): 


(y — Bs)s? 
T 


Bt=B+6 (8.90) 


where @ is chosen to avoid a singular B+. (Note that 6 = 1 in (8.78).) To 
avoid singular B+, Lemma 8.1 and Formula (8.90) are used to yield 


s?B-ly 


sls 


). 


det Bt = det(B) x [((1- 6) +0 1, (8.91) 


(B-ly — s)st 


sls 


(since BT = B|I+0 


Now @ is chosen as close to 1 as possible subject to | det Bt| > o| det(B)| 
for some specified o € (0,1). 

We have the following local convergence theorem for Broyden’s method, 
which we present without proof. 


THEOREM 8.14 

Let F be continuously differentiable on an open conver set D € R”. Let there 
be an x* € D such that F(a*) =0, and F’(a*) is nonsingular. Furthermore, 
suppose that there is a constant ¢ such that 


\|F’(x) — F"(a*)|| < éle—2*|| for 2 € Dz 


Then, Broyden’s Method is locally and superlinearly convergent to x*. 


REMARK 8.22 Theorem 8.14 states that 


[seats 2 
[ea - 


Numerical Solution of Systems of Nonlinear Equations 479 


where a; — 0 as k — oo. Under the same hypotheses, Newton’s method is 

quadratically convergent, i.e., 
[seal 

IIc —a*|2 ~ 


for k sufficiently large. (See Proposition 8.1 on page 450.) 


REMARK 8.23 Although x‘) — «* superlinearly, it is not necessarily 
true that B, — F’(a*) as k — oo. (Indeed, Broyden’s Method is not self- 
correcting, i.e., B;, may retain harmful information contained in B;, 7 < k.) 
Consider 


fi(a1, £2) = 71, 


fo(t1,%2) = 224+ Be, 


pis Ud 0 Bdge pd 0 _ (146 0 
Fa) =(4 ones Fa@’)=(4 He and Bo = ( 0 a 


However, the (1,1) element of By, can be shown to be 1+ 6 for all k; thus, 
{B;,} does not converge to F’(«*). 


8.5.1 Practicalities 


Quasi-Newton methods were originally developed in an era when finding a 
solution to 10 nonlinear equations in 10 variables was a challenge for many 
systems that are almost trivial today, when symbolic computation (computer 
algebra) was in its infancy, etc. One of the original rationales was to avoid 
having to derive and program a Jacobian matrix. For many systems today, 
automatic differentiation (as introduced in Section 6.2, page 327) is practical, 
and is used, due to the superior convergence properties of Newton’s method; 
furthermore, automatic differentiation can compute matrix-vector multiples 
F'(x)v in less than O(n?) operations. 

However, quasi-Newton methods with “secant updates” (updates to the 
matrix B;, defined by the secant equation (8.75)) are still useful in modern 
scientific computation, such as for truly black bor systems® or for very large 
systems where a structure can be imposed on the Bx, etc. 


%that is, for functions F whose values are obtained by some process which cannot be ana- 
lyzed. Such a process is called a “black box.” 
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8.6 Methods for Finding All Solutions 


To this point, we have discussed iterative methods for finding approxima- 
tions to a single solution to a nonlinear system of equations. In many appli- 
cations, finding all solutions to a nonlinear system is required. Salient among 
these are homotopy methods and branch and bound methods. 


8.6.1 Homotopy Methods 


In a homotopy method, one starts with a simple function g(x), g : D C 
R” — R” such that every point with g(#) = 0 is known, then transforms 
the function into the f(x), f : D CR” — R” for which all points satisfying 
f(a) = 0 are desired. During the process, one solves various intermediate 
systems, using the solution to the previous system in an initial guess for an 
iterative method for the next system. A typical such transformation is 


H(a,t) = (1—-t)f(x) + tg(z), (8.92) 


so H(x,0) = f(x) and H(az,1) = g(x). One way of following the curves 
H(x,t) = 0 from t = 0 to t = 1 is to consider y = (2,t) € R"*", and to 
differentiate (8.92), obtaining 


H'(y)y’ =0, (8.93) 


where H(z) is the n by n+1 Jacobian matrix of H. If H’ is of full rank, H’(z) 
has a one-dimensional null space, parallel to y’. At step k of the method, one 
can take a vector vp in this direction, with H’(y,)vp = 0, say ||v|| = 1, and 
say UE UR-1 > 0. One computes a predictor step 


Zk = Yk + KVR, (8.94) 


then one corrects z, by iterating Newton’s method on the system 


co) =0, (8.95) 


where N(y) is a normalization function. For example, if we want to correct 
in a direction perpendicular to the predictor step vz, we could use 


N(y) = vg (y — 2x)- (8.96) 


Similarly, if t corresponds to the (n+1)-st coordinate of y and we want to cor- 
rect perpendicularly to that direction, we could set N(y) to be the difference 
between the (n+1)-st coordinates of y and z,. Such a two-step approach (com- 
puting z, tangent to the curve, then correcting back onto the curve) is called 
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a predictor-corrector method, not to be confused with a predictor-corrector 
method for differential equations. 


In fact, however, (8.92) along with the normalization condition defines a 
derivative y’, so, in principle, methods and software for finding solutions to 
initial value problems for ordinary differential equations can be used to follow 
the curves of the homotopy. Indeed, this approach has been used. 


Determining an appropriate starting function g is a crucial part of a homo- 
topy method for finding all solutions to a system of equations. Particularly 
interesting is finding such g for polynomial systems of equations, where there 
is an interplay between numerical analysis and algebraic geometry. Signifi- 
cant results were obtained during the 1980’s; for example, see [59]. In such 
techniques, the homotopy is generally defined in a space derived from complex 
n-space, rather than real n-space. 


While finding all solutions to a nonlinear system with these techniques is 
called a homotopy method, an actual technique for following the solution 
curves of a system of equations H(y) = 0, where H : R"t! — R® is termed a 
continuation method. In addition to solving systems of nonlinear equations in 
homotopy methods, continuation methods are used to analyze parameterized 
systems of differential equations as the parameter (which we can view for now 
as the variable t) changes. In such systems, points along solution curves at 
which the Jacobian matrix H’ is not of full rank are of interest. Those points, 
where two or more solution curves can intersect, are termed bifurcation points. 
Bifurcation points are of physical significance in the models giving rise to 
these systems,’? and there is a rich mathematical theory for classification and 
analysis of bifurcation points. 


An introduction to continuation methods is [2], while an example of software 
is [98]. A relatively early reference on use of homotopy methods for solving 
polynomial systems is [60]; search the web for more recent work. 


8.6.2 Branch and Bound Methods 


Branch and bound methods, which we explain later in 89.6.3 in the context 
of optimization, can also be used to solve systems of nonlinear equations. In 
this context, the equations F(x) = 0 can be considered as constraints, and the 
objective function can be 37}, f?(x), for example. See §9.6.3 for clarification. 


10in fluid dynamics, biology, economics, etc. 
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Exercises 


1. Use the univariate mean value theorem (which you can find stated as 


Theorem 1.4 on page 3) to prove the multivariate mean value theorem 
(stated as Theorem 8.1 on page 441). 


. Write down the degree 2 Taylor polynomials for fi (a1, 22) and fo(a1, x2), 


centered at & = (41,42) = (0,0), for F as in Example 8.1. Lumping 
terms together in an appropriate way, interpret your values in terms of 
the Jacobian matrix and a second-derivative tensor. 


. Show that if F’ is Fréchet differentiable at x, then F’ is continuous at z, 


ie., prove that given € > 0 there is a db > 0 such that ||F(a) — F(y)|| < € 
whenever || — y|| < 6. 


. Let F' be as in Example 8.1 (on page 441), and define 


0.0262 0.0767 


(a) Do several iterations of fixed point iteration, starting with initial 
guess 7) = (8.0, 0.9)". What do you observe? 


(b) Use Theorem 8.3 (on page 444) and Theorem 8.2 (the Contrac- 
tion Mapping Theorem, page 442), if possible, to show that fixed 
point iteration will converge within a ball of radius 0.001 of « = 
(—8.2005, —0.8855)". Relate this to Theorem 8.4. 


. The nonlinear system 


xi — 1021 + 73+8=0, 


ryr3 +2, —-10%2.+8=0 


can be transformed into the fixed-point problem 


2 2 
rj+254+8 
= gi(@1, £2) = =a 
x03 +21+8 
L2 = 92(%1,22) = > 9. = 


Show that G(x) = (g1(x), g2(z)? has a unique fixed point in 
Do = {(a1, 22) € R? :0 < 21,22 < 1.5}, 


and that the fixed point iterations e+!) = G(a)) converge for any 
(0) 
xr) € Do. 


10. 
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. Perform 4 iterations of the fixed-point method in problem 5 with initial 


vector £) = (0.5, 0.5)". 


. Let B be an n x n real matrix with p(B) < 1 and define G: R” — R” 


by G(az) = Ba + b with b € R”. Show that G has a unique fixed point 
z* and the iterates +) = G(a™)) converge to x* for any x ER”. 


k+ - k 
. Consider xt) = a ‘ = 0 Om cos(a) tis A 
vy —0.25 0.5 sin(«s"”) 2 


with 7 = ia 


(a) Prove that {a‘")}% , converges to a unique x* € R?. 


(b) Estimate the number of iterations required to achieve 


[| — 2*||o0 < 0.001. 


. Univariate Newton iteration applied to find complex roots 


f(a +ty) = u(az, y) + iv(z,y) =0 


is equivalent to multivariate Newton iteration with functions 


filx,y) = u(z,y) =0 and 
fa(x,y) = u(x, y) = 0. 


(a) Repeat Exercise 35 on page 82 in Section 2.8, except doing the 
iterations on the corresponding system u(x, y) = 0, v(x, y) = 0 of 
two equations in two unknowns. 

(b) Compare the results, number by number, to the results you ob- 
tained in Exercise 35 on page 82 in Section 2.8. 


Apply several iterations of Equation (8.37) (on page 453), for computing 
the inverse of the matrix 


rd 20 
A=) 1.2 41.1; 
Ciel 9 


with starting matrix 2) equal to the diagonal matrix whose diagonal 
entries are 1/2. 


(a) Do you observe quadratic convergence? 


(b) How many operations per iteration are needed if A is a general n 
by n matrix? How many are needed if A is an n by n tridiagonal 
matrix? 
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11. 


12. 


13. 


14. 


15. 


16. 


Classical and Modern Numerical Analysis 


(c) Do you see situations where this method of computing the inverse 
of a matrix might be more practical than Gaussian elimination? 


Show that the one-step SOR-Newton method has the form in Equa- 
tion (8.58). 


Consider solving the nonlinear system 


az? — 102, +23 +8 =0, 


r1r5 +2, —-10%2+8=0 


Perform 4 iterations of Newton’s method with the initial vector 7 = 
(0.5, 0.5)" 


If F(z) = (filz), fo(x), i > fn(2)), where 


fi(a) =a“ -a 
fi(x) = —aj_-1 + 3a; +e — 244) for i = 2,3,...,n—1, and 
fn(x) = nm — b, 


then F(a) = 0 has a unique solution 7* € R”. Show that: 


(a) 
(b) 
(c) (F’(x))~? exists for all x € R”. 
(d) (F’(x))~1 > 0 for all x € R®. 


F is continuously F-differentiable for all « € R”. 


F is convex on all of R”. 


Moreover, show that for any z) ER”, the Newton iterates converge to 


ax. 


Consider finding the minimum of 
f (v1, 22) = e™! +e”? — ano +2? +22 — 21 — 229 +4 


on R?. Prove that Newton’s method 
=i, 
ght) = ol) — (v?F(e)) Vi(2) 


converges to the unique mimimum « € R? for any initial guess 2 € R?. 


Show that, if A is a Lipschitz matrix for F' over x, then A is a slope 
matrix for F at & over a, for any & € a. 


Suppose F’: x — R”, where x is an n-dimensional vector whose entries 
are intervals, and suppose we form an interval matrix A such that the 
(i, 7)-th entry of A is an interval extension of the j-th partial derivative 
Of;/Ox,; of f; over 2. Show that A is a Lipschitz matrix for F over x. 


17 
18 
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. Prove Proposition 8.2 (on page 464). 


. Let F be as in Exercise 9 on page 483, let x = ({—0.1, 0.2], [0.8, 1.1])7, 
and #% = (0.05, 0.95)”. 


(a) Apply several iterations of the interval Gauss-Seidel method; in- 
terpret your results. 


(b) Apply several iterations of the Krawczyk method; interpret your 
results. 

(c) Apply several iterations of the interval Newton method you obtain 
by using the linear system solution bounder verifylss in INTLAB. 


. Explain why, if F’(x) is a Jacobian matrix for F’': DC R” — R” and 
Y € R"™” is an n by n matrix, then Y(F"(x)) represents the Jacobian 
matrix for Y F. 


. Prove Proposition 8.3 on page 466. 


. Fill in the details of the relationship, presented on page 467, of the 
interval Gauss-Seidel method to the univariate interval Newton method. 


. Fill in the details of the proof of Theorem 8.11 (on page 468). 


. By writing the quantities down in terms of sums of partial derivatives, 
show that J —Y F’(x) is the Jacobi matrix for G(x) = «— Y F(x), where 
F:DCR”"—R". 


. Show that if n = 1, the quasi-Newton equation (Equation (8.75) on 
page 473) reduces to Equation (8.76). 


. Prove that, for v € R” and w € R”, ||v + wll2 = |lv|lo + ||wll2 if and 
only if v = aw for some positive scalar a. Use this fact to show that 
the strict inequality in (8.80) on page 475 holds. 


. Let F be as in Exercise 9 on page 483. Do several iterations of Broyden’s 
method, using the same starting points as you did for Exercise 9; observe 
not only x), but also By. What do you observe? Do you observe 
superlinear convergence? 


. Suppose f(x) = (x — 2)(x+ 2), g(x) = x? — 5x24 4, and form H(y) : 
R? — R! according to (8.92) (on page 480). 


(a) Using « = 0.4 in (8.94) and using Newton’s method on (8.95), with 


normalization equation (8.96), follow the curve H(y) = 0 from 
t=0,%=-2tot=1. 


(b) Draw a picture of the curve in R?, and draw your predictor steps 
and corrector steps on that curve. 


Chapter 9 


Optimization 


Classical optimization involves finding the minimum or maximum of a function 
yp: DCR” — R with respect to its argument « € R”. The function ¢ is called 
the objective function. Sometimes, there are no restrictions on the argument 
zx, in which case the problem is said to be unconstrained. Often there are 
side conditions, or constraints, expressed as inequalities and equations; the 
problem is then said to be constrained. A general optimization problem can 
be expressed as 


minimize v(x) 
subject to cj(z) = 0,7 =1,...,m1, 
gi(x) <0,¢=1,...,me, (9.1) 


where y: D— Rand c,g;: D— R, 
and 2 (Bis. yaita)- 


(Thus, the problem is unconstrained if m; = mz = 0 and D = R”.) We will 
write 


ce(z) : R” = R™ = (c1(x),...,em,(x))7 
and 
g(x) :R” > R™ = (gi(z),.-- 9a (a))” 

Sometimes, some of the inequality constraints are simple, of the form x; > 
Z;, U;, < Fj, that is, «7; € wv; = [x;,%;]. In such cases, the constraints are 
called bound constraints. Algorithms for solving (9.1) often gain efficiency by 
treating bound constraints specially. 


REMARK 9.1 The optimization problem (9.1) is often called a nonlinear 
program. This term comes not from computer programming, but from the 
fact that optimization problems often come from operations research, where 
the solution to the problem provides managers with a program of production 
and distribution to follow. Along these lines, if y represents a total cost, 
then the “objective” is to follow a program that minimizes the cost. If the 
functions y, c;, and g; are linear, then the problem is called a linear program, 
and the process of writing down and solving such problems is called linear 
programming. 


Two aspects of the solution to Problem (9.1) should be distinguished. 
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DEFINITION 9.1 An optimum of Problem 9.1 is a value G@ taken on 
by y at least at one point x* for which c(a*) = 0 and g(a*) < 0, such that 
p(x) > @ for every x satisfying c(x) = 0 and g(x) < 0. Every such point x* 
is called an optimizer or optimizing point for Problem 9.1. 


The general optimization problem is very difficult to solve; in fact, it is 
known to belong to a class of problems computer scientists call NP-complete. 
If a problem belongs to the NP-complete class, then no algorithm is known 
whose execution time is guaranteed to be O(n*) for every instance of the 
problem (that is, for every choice of y, c, and g). (If such a k exists, the 
algorithm is said to execute in polynomial time.) From a practical point of 
view, there is no general algorithm that solves (9.1) in a practical amount of 
time for every choice of y, c; and g;. Nonetheless, there are subclasses of prob- 
lem (9.1) for which general algorithms can be designed that will always finish 
in a practical amount of time. Moreover, recently, sophisticated algorithms 
have been developed for problem (9.1) that complete on present computers in 
a practical amount of time, for many y, c and g arising in applications. 

An important subclass of optimization problems consists of those instances 
of (9.1) in which the objective y, the equality constraint functions c, and the 
inequality constraint functions g are all linear functions of the parameters 7; 
such optimization problems are called linear programs. We discuss solution 
of linear programs in Section 9.4, starting on page 503. 

Another important subclass of optimization problems, containing the class 
of linear programs, is when y and the c and g are convex. (See Definition 8.7 
on page 455.) Such problems, termed convex optimization problems, or convex 
programs can be solved in polynomial time. (Sometimes, problems that can 
be solved in a practical amount of time are called tractable.) 

The set of « € D satisfying the constraints c(z) = 0 and g(x) < 0 is 
called the feasible set. Some practical problems have no objective function 
(or, equivalently, can be considered to have a constant objective function). 
These problems are called constraint satisfaction problems. 

Optimization is a burgeoning field, with researchers and practitioners from 
many departments, such as mathematics, computer science, engineering de- 
partments, operations research and other business-oriented departments, and 
industrial laboratories. A thorough treatment of algorithms for all subclasses 
of problems and applications of practical interest is outside the scope of this 
text. However, we will cover certain basic principles and present the overall 
elements of some of the most prominent algorithms. 

We consider algorithms for finding the overall minimum of ¢ over the entire 
feasible set in Section 9.6, starting on page 518. A related problem, often 
solved in practice because it is much easier, is the local optimization problem, 
in which we merely seek a point x such that y(y) > y(a) for every y in the 
feasible set and in a sufficiently small ball in R" about x. We discuss local 
optimization in the next section. Several references for the material presented 
in this chapter are [11], [32], [38], [54], [61], [65], [91], and [93]. 
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9.1 Local Optimization 


Local optimization corresponds to the concept of local convergence intro- 
duced in the chapter on methods for nonlinear systems of equations. In par- 
ticular, we start with an initial guess, and use some iterative method to find 
a point x* such that x* satisfies the constraints c(z*) = 0 and g(#*) < 0, and 
p(a*) is the smallest value of y within some ball in R” containing x*. This is 
in contrast to global optimization, in which we try to find the global optimum 
of y over all x within the domains of y, c, and g. 


9.1.1 Introduction to Unconstrained Local Optimization 
In the next two sections, we seek a local minimizer of a function y: DC 
R” — R!. That is, 


we seek «* € D such that for some 6 > 0, y(a*) < y(a) for all 
x € D such that ||2 — 2*|| < 6, ie., for all 


xz € S(2*,6) ={zx€ D: ||x—x*|| < 5}. 


See Figure 9.1. . 


@) 
FIGURE 9.1: A local minimizer x* only minimizes over some S. 


Example 9.1 

v(x) = v(21,22) = x? + 22 and D = {(%1,22)7 € R? : -1 < 21,22 < 1}. 
For this example, «* = (0,0)7 is a local minimizer. In this example, x* also 
happens to be a global minimizer of y, that is y(a*) < v(x) for alae D. O 
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Iterative methods for finding an approximate local optimum of an objective 
function y of n variables often repeatedly find an optimum of a function of one 
variable. Such univariate optimization, used in this context, is often called a 
line search. We consider a particular line search method in the next section. 


9.1.2 Golden Section Search 


Consider the problem of finding the minimum of scalar function v(x) on 
the interval [a,b]. Assume that y(2) has a unique minimum at the point 
x* € (a,b). The golden section search procedure to find «*, consists of the 
following steps: 


1. Set 


vo — a, 


x, —a+t+(b-a)(1-a), 


vq —a+t(b—a)a, 
v3 b, 
for some a, 1/2 <a <1. (An optimal a is a = (-1+ /5)/2, as we will 
show later.) 
2. DO WHILE (|é3 — @0| > €) 
(a) IF y(a1) > v(x), THEN x* € (#1, 23), and 


(i) Zp — 21 and 3 — x3; 


(ii) &2 — %o + (43 — fo)a; 


(iii) Ly oad Lo a (43 a: %o)(1 — a). 
ELSE «* © (x0,%2), and 
(i) Zp — 2p and #3 <— x9; 


(ii) @2 — % + (43 — fo)a; 


(iii) Ly = Lo Te (43 2 £o)(1 = a). 
END IF 


(b) IF |%3 — %0| < « THEN 
ELSE 


Lo — Ko, Lt — £1, Lo — Lo, 3 — F3. 


END IF 
END DO 
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REMARK 9.2 In Step 2(a) of the above procedure, 


1 — Xo 2 — Xo 


- —=1l-—a and - —=a 
L3 — Xo t3 — Xo 

for each iteration. In fact, we can find an a so that, simultaneously, %, <— 

&o9 + (€3 — #9)(1—@) in the third branch of step 2a can be replaced by #1) — x2 

and a — &o + (#3 — £o)a@ in the second branch of step 2a can be replaced by 

£2 <— x1. For this simplification, a must satisfy 


t2 = %,+(%3-21)(1—a) and 


Ly = Xo + (x2 — x )a. 


Solving these equations for a (utilizing the relationships between xo, 11, Xa, 
and x3 defined in step 1) gives 


His 
VS 


~ 618. 
5 618 


(The number 1+ a, ubiquitous in classical elementary geometry, is sometimes 
called the golden mean.) Using this a, we only need to define one new point 
on each iteration. The significance of this is that only one evaluation of y is 
required per iteration.* 


REMARK 9.3 _ If [x\"), c{”] is the k-th interval in the golden section 
procedure with ao) =a and ao) = b, it is easy to show that oS") - a) = 


a (b— a), where a = (-1+ V5)/2. Thus, since 2* € [o*), ao), we have 


1 
< 50 (6-4), for k = 0,1,2,--- 


9.1.3. Relationship to Nonlinear Systems 


We now return to consideration of problem (9.2) in the general n-dimensional 
setting with n > 1. In the remainder of this section, we assume y is differen- 
tiable. In this case, trying to find a zero of Vy, the gradient of y, is usually 
part of the process for solving (9.2). This approach is based on the fact that 
if x* is a local minimizer of y on an open set D and y is differentiable at 
x*, then necessarily Vy(«*) = 0. Since Vy(z) = 0 consists of n equations in 
the unknown components of x, we see that the minimization problem, when 


lIn general, y may be quite complicated, so reducing the number of evaluations of y 
significantly improves the algorithm’s efficiency. 


492 Classical and Modern Numerical Analysis 


y is differentiable, leads to finding the solutions to a system of n nonlinear 
equations in n unknowns. (Points x at which Vy(x) = 0 are called critical 
points of the unconstrained optimization problem.) 


Example 9.2 
Consider v(x) = 27 + 23. Then 


dp dp eo 
Vy(2) = se => (221, 2x5)". 
dx, dx2 


Thus, Vy(x) = 0 when x1 = x2 = 0. ] 


We may therefore attempt application of the methods of the previous chap- 
ter to finding the solution of Vy(a) = 0. For instance, since F = Vy(x) isa 
mapping from R” to R”, we may use (when y is twice differentiable) Newton’s 
method, which here takes the form 


aH) & gl) — (7 o(a)) 1 Vo(x2™) for k= 0,1,2,--- (9.3) 


where V7y(z) is Hessian matriz of vy at 2, i.e., V7:o(x) is the Jacobian matrix 
of V(x). Under appropriate conditions, we can obtain local and quadratic 
convergence of (9.3) to a zero of Vy. 


REMARK 9.4 _ The Hessian matrix has the form 


p(t) Pyle)  MPy(z) 
Ox> Ox2021 Oxt,0X1 
P(t) Pyle) HP (2) 
Ve) Ox,0x%2 Ox Oty, OX2 
g(a) = 
p(t) Pyle) H(z) 
02102n Ox20Xy Ox? 


REMARK 9.5 Even though Vy(z) = 0 when z is a local minimizer of 
y, there are many places other than at local minimizers where Vy can equal 
0. Thus, practical algorithms for minimization of y may need to include 
procedures in addition to solving Vy = 0. 


REMARK 9.6 Wecan also transform the problem of finding the solution 
to any nonlinear problem into a minimization problem. Suppose that we seek 
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x such that F(x) = (fi (x), fo(x),-*» , fn(x))? = 0. Define 


n 


v(x) = Di (fi(2)). 


i=l 


Then F' will have a zero when the function y has a global minimum, although 
there may be local minima of y that do not correspond to zeros of F', and 
y may have a minimum, even though F has no zeros. (Notice that F : DC 
R” — R® while yp: DCR” —R’.) 


We now briefly consider three other methods for numerically solving the 
minimization problem (9.2). First, we consider the method of steepest descent. 


9.1.4 Steepest Descent 


Recall that we seek y(a*) such that y(«*) < v(x) for all « € D such that 
||]z — 2*|| < 6. 

In descent methods for finding x*, we pick 7), then try to find direction s 
such that 2 = 7 + dps satisfies (a) < y(#) for sufficiently small 
Ax. In general, a descent method generates a direction s‘) of local descent for 
each iterate a"), in the sense that there is a AX such that y(a) + As") < 
y(a")) for \ € (0, Az]. The next iterate is of the form 


gFH1) — ofF) 4 dy gl), (9.4) 


where ); is chosen (according to one of various strategies) to assure y(a"t)) < 
y(a)). The direction s‘*) and the parameters A; should be chosen so the 

sequence {Vy(a‘*))} converges to 0. If || Vy(a))|| is small, then usually a”) 

is near a zero of Vy, while the fact that the sequence {y(a))} is decreasing 
indicates that this zero of Vy is probably a local minimizer of y. 

The simplest example is the method of steepest descent, for which we ask 
for a vector § of unit length with respect to the L2-norm (i.e., ||S|]2 = 1) 
such that the directional derivative D3y(x) is minimum over all directional 
derivatives Dep(x). Assuming that Vy(x2) 4 0, § = —Vy(x)/||Vye(x)|l2, since 
the directional derivative of y is minimized in direction —Vy(a). Thus, the 
method of steepest descent is given by 


20) =) — Volek), k= 0,1,2,---, (9.5) 


where , is chosen to guarantee that y(a"*)) < y(a™). 
The following result guarantees the existence of such a parameter. 


LEMMA 9.1 

Let p : R” — R be defined on an open set D and differentiable at x in D. 
If [Vo(a)|"s < 0 for some s € R", then there is a * = \*(x,s) such that 
\* > 0 and y(a + As) < v(x) for all X € (0, A*). 
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PROOF The proof follows from the fact that 


jim, ola + As) ~ ola) _ [Vyo(2)]*s. (9.6) 


U 


This lemma guarantees, in particular, that the parameter A; in the steep- 
est descent method can be chosen such that y(xt)) < y(#™). This is 
not sufficient to show that {a)} approaches a zero of Vy, since A, may 
be arbitrarily small. (For example, A, might happen to be chosen so that 
Jc%t) — 2 || < €/2*, with the result that {x} converges to a point % such 
that |x — zl] < 2e.) 

We now turn to a selection of A, in descent methods of the form 


atl) — gl) 4 dp 5(*), (9.7) 
where s‘*) = —Vy(a)) for the steepest descent method. Consider 
afk+1) — ol) _ \. Vep(ag). 
We wish to find A that minimizes the scalar function 
hQA) = o(2**) = g(x — AVe(2™)) 


for \ in an interval [0, A*]. One way would be to use the golden section search 
procedure, assuming * is large enough to guarantee that a minimum of h is 
in the interval [0, A*]. Another way would involve differentiating h and deter- 
mining the critical points of h directly; this is generally too costly a procedure. 
A third approach begins with selecting three nonnegative estimates Ax, Ako; 
Akz to Ay. Then, the quadratic polynomial interpolant to h through Ax, , Az, 
and A;, is calculated. Next, A, is defined to be the number that minimizes 
this quadratic polynomial. For example, if Ax,, Ax,, and Ax, are set equal to 
0, 4, and 1, respectively, then 


1 
p(A) =git hyA+ h3r(r - 5) 


interpolates h(A) at 4 = 0, 3, and 1, where 


n= g(x), 
1 
g2= (2 7 sVe(a™) ) 
gs = oe — Veo(a)), 
hy = 2(g2 — 91), 
ho => 2(g93 = 92); and 
h3 = hg — hy. 
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Then, let 
ga = o(e™ — aVeo(x))), 


where a = 1/4 — hi /(2h3) is the critical point of the polynomial p(\). Next, 
Ax is selected from {a, 0, 4, 1} such that 
g(t — AxVo(a)) = min{ga, 91, 92, 93}- 


Finally, 2+) = 2) — ).Vy(a)), 


REMARK 9.7 There are many variations in the method of steepest de- 
scent. In particular, intricate methods have been devised for determining Ax. 
However, in general, steepest descent methods are only linearly convergent, 
but will converge independently of the starting approximation. Of course, the 
methods may converge to a minimum that is not the absolute minimum of y. 


REMARK 9.8 Unlike Newton’s method for nonlinear systems of equa- 
tions, the steepest descent method for nonlinear optimization is sensitive to 
scaling, and is quite sensitive to the condition number of the problem. In 
particular, convergence can be very slow if the ratio of the largest eigenvalue 
to the smallest eigenvalue? of the Hessian matrix is large. 


9.1.5 Quasi-Newton Methods for Minimization 


In quasi-Newton methods, the emphasis is on finding particular solutions to 
the system of nonlinear equations Vy = 0 that correspond to local minima of 
y. When we speak of quasi-Newton methods, we usually mean that we use an 
iteration matrix constructed with the quasi-Newton equation (Equation 8.75 
on page 473), rather than the Jacobian matrix of F. For the minimization 
problem F = Vy, and the Jacobian matrix is the Hessian matrix of y. We 
call the formula for obtaining a new approximation to the Jacobian matrix 
according to (8.75) an update. Special quasi-Newton updates (other than 
Broyden’s method, explained in §8.5) can be designed to appropriately ap- 
proximate Hessian matrices near local minimizers. We now describe one of 
these. 


9.1.5.1 The Davidon—Fletcher—Powell and BFGS Updates 


Recall that we seek x* € D such that for some 6 > 0, y(a*) < y(x) for all 
x € D such that || — x*|| < 6 where y: D € R" — R’. If we use Newton’s 
method to solve this problem, we seek a solution 2* of Vy(a*)) = 0, and 


2each of which must be positive at a local minimum 
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Newton’s method has the form 
24) = o® — (W790) Vea) (9.8) 


where V?y(a")) is the Hessian matrix. (Note that V?y(a)) is symmetric 
and positive definite at « = 2*.) 

In quasi-Newton methods, we replace V?y(x‘*)) by some approximation 
B,,, thus obtaining 

at) — o() _ BV o(a*), (9.9) 

One choice of quasi-Newton method for solving (9.9) would be Broyden’s 
method. However, since V?y(x“)) is symmetric and positive definite (for 
x") near a*), it would be desirable to have these features in By. 

The Davidon—Fletcher—Powell (DFP quasi-Newton update provides these 
features, i.e., if By is symmetric positive definite, then By, is symmetric 
positive definite. This method has the form: 


e+) — 2) _ BV p(2*). (9.10) 
y = Vo(a"*)) — Vo(a), s = ct) — of) (9.11) 
T T TF 
ys sy YY 


It can be shown, using similar techniques to those in the analysis of Broy- 
den’s method, that, if 


Ay, = Hy + = - 
k4+1 kt a aaa 
and Hy, = Bes then Ay4y1 = Be Thus, the Davidon—Fletcher—Powell 
method can be implemented as 
ak+1) — go) _ A, Vo(a*). (9.13) 
y = Vo(a"*)) — Vo(e), 5 =a) — 2) (9.14) 
Ayyy? H. 
fae ot: [ee oy acetic cer (9.15) 


sty —-y? Ay 


0) -1 


Here, 2) is an initial guess, and Hp is either (V2y(a)) 
proximation to (V2y(a))-!, 

There are also other, perhaps better, quasi-Newton updates for solving min- 
imization problems, such as the Broyden—Fletcher—Goldfarb-Shanno (BFGS) 
update; see [25, p. 457]. The BFGS update is defined by 

~ ~ (s— Hy)s? +8(s— Hyy) yy? (s — Hey)ss™ 
Ay. = Hy + >: SF Fs A 9.16 
k+1 k+ ra (fs)? ) ( ) 
where H;, is the k-th approximate inverse for the BFGS update, and where y 
and s are as in (9.14). 

A classic reference for methods that incorporate quasi-Newton methods is 

[24]. 


or a good ap- 
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9.1.6 The Nelder—Mead Simplex (Direct Search) Method 


In this section, we consider the downhill simplex method of Nelder and 
Mead for finding a minimum of y(z) where y: D € R" — R!. The method 
requires only function evaluations. This heuristic? search method is practical 
in many cases for a small number of variables, and is versatile, since the only 
information about y that is required is, if 2; € R” and x2 € R” are in the 
domain of y, we can determine* whether or not (11) > y(x2). Virginia 
Torczon has analyzed convergence of this method and generalizations of it, 
termed pattern search algorithms. The main disadvantages of the method 
are slow convergence when high accuracy is required and impractically slow 
performance on high-dimensional problems. 


REMARK 9.9 The Nelder-Mead simplex method is used in the MATLAB 
function fminsearch. 


ALGORITHM 9.1 
(The simplex method of Nelder and Mead) 


INPUT: 
1. the initial point «** € R”; 


2. the heuristic parameters A, a, 3, and y. (Here, X is related to the 
problem’s length scale.) 


3. the stopping tolerance ec. 
OUTPUT: an approximate optimizer x and an approximate optimum (2). 
1. (Assign the initial simplex; see Remark 9.10 following this algorithm.) 
(a) a) — 2. 
(b) Giga — 41 + Ae; for i =1,2,3,--- ,n. 


2. Order the points x1, 2, -++, Ln41 80 that 


Pn41 2 Yn 2 Pn-1 2° 2 22 $1, 


where pr = Y(&x)- 


3A heuristic is a rule of thumb that is used to determine whether or not a mathematical 
property is true. If the property is true according to the heuristic, then in actuality, the 
property will be true in many cases, but not in all cases. Thus, a heuristic method may not 
always find the answer, but often does. 

4This determination can be done, say, interactively. For example, if y corresponds to a 
polynomial fit to data points, the user can be presented with a graph of the polynomial 
corresponding to 21 and a graph corresponding to x2, then the user can tell the computer 
which one is better. 
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3. (Construct a new point. At each iteration, a new simplex is produced 
by replacing the “worst” point 2,41, the point with the highest value 


of y.) Let 
1 n 
Le — a > “ x5 


be the centroid of the n best vertices. In this step, we construct 
Lr — Le + a(ae :. Ln41), 
where 2n+1 is the point with the highest function value and where a is 
a reflection coefficient. 
Set yp, — y(a,). 


Case 1: IF yi < Yr < Yn (a, is not a new best point or a new worst 
point), THEN x, — ap. 


Case 2: IF y, < 1 (a, is the new best point), then we assume that the 
direction of reflection is good direction, and we attempt to expand 
the simplex in this direction. We define an expanded point by 


Ve — Xe + B(x, a Le); 


where @ > 1 is an expansion coefficient. 

IF Ge < Y, 

THEN the expansion is successful, and x} 41 — Ze. 
OTHERWISE, the expansion failed, and x7... — Zr. 


Case 3: IF y,; > Yn, the simplex is assumed to be too large, and should 
be contracted. A contraction step is carried out, where 


Ter hee ee _ Le) af Pr = Pn+1) 
Leo 
Ler (Lr _ Xe) if Pr < Pn+1; 


where 0 < y <1 is a contraction coefficient. 
IF ge < min(¥,, Pn41), the contraction step has succeeded, and 


* — 
Lnpy = Le. 


OTHERWISE, a further contraction is carried out. (That is, repeat 


this step.) 
4. IF 
ntl n+l 
dole — wP/n<e, where p= D7 y;/(n +1), 
j=l ae 


THEN continue to step 5. 
OTHERWISE: 


Optimization 499 


(4) 2n41 — Ln41i 


(b) Return to step 2. 


5. (If the standard deviation of the function values is smaller than a specific 
tolerance ¢€, then the search terminates for the starting guess x**.) 
IF |\a2** — x,|| > «, THEN 


(a) «** — a4; 


(b) restart the problem with one vertex of the simplex at the new min- 


imum point x1. That is, return to step 1 after setting x** — 2x1. 


(Restarting is performed so that the criterion in step 4 is not fooled 
by a single anomalous step that, for some reason, failed to move 
the worst point by more than «.) 


(c) IF however, ||x** — x1|| < ¢, THEN 


i. OUTPUT: 2 and (21). 
ii. HALT. 


END ALGORITHM 9.1. 


REMARK 9.10 The n+1 points 71, 22, ---, p41 in Step 1 of Algo- 
rithm 9.1 define an n-dimensional simplex. A simplex is a geometrical figure 
consisting in n dimensions of n+ 1 vertices, all interconnecting line segments, 
polygonal faces, etc.; see Figure 9.2. 


3 dimensions (n = 3) 


L3 
2 dimensions (n = 2) << | 
2 
v1 . | 
3 


Z2 


FIGURE 9.2: Illustration of 2- and 3-dimensional simplexes. 
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Consider v(x) = x? + y? — 4. 


Initialization: Set x** — (—2,3),A\ -— 1l,a<—1, 6 — 2, y — 1/2, and 
e+ .O1. 


Step 1: x2 — (—1,3), r3 — (—2,4), v1 = 2** = (-2,3). 


Step 2: y(x1) = 9, y(xe) = 6, y(x3) = 16. Thus, set 7; — (—1,3), ro — 
(—2,3), and x3 — (—2,4) (reorder the points). 


Step 3: 


1 3 
Le — 5 ti = (—5,3), 
Lp B+ a(Le = ida) ae (—1, 2), Pr — p(x) =1. 


Case 2: Perform an expansion, since y, < 1: 


1 
Le — Let Bap — &e) = 24, — Xe = (5> —1). 


Since y(te) = —4t < ¢(a,), the expansion is successful, so set 
@3* — Le = (5, —1). 


Step 4: ||z3*—23||o0 = 5 > €. Set x3 — x3* = (4, —1), then return to step 2. 
Step 2 9(x1) = 6, y(x2) = 9, and (x3) = —+4. Thus, reorder: 


1 
t= (5.—)), 2 >= (—1,3), v3 = (—2,3). 


(This results in yg, = —#, yo = 6, and v3 = 9.) 


Step 3: 
2 
1 
Te x; = (——,1), 
t=1 
3 3 
Br — Le + Oe — £3) = (5,—1), Pr Yltr) = 5: 
Case 1: y1 < Y, < Yo. Thus, set 
x3" —-— Lr = (5) —1). 


Step 4 ||x3* — x3|]0. = 4 > e. Thus, set x3 — x3* = (3,—1), then return to 
step 2 and continue with the algorithm. 
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9.1.7 Software for Unconstrained Local Optimization 


General software for unconstrained local optimization typically uses combi- 
nations of steepest descent and Newton or quasi-Newton methods in sophisti- 
cated ways. Some local optimization algorithms are embedded into interactive 
systems such as MATLAB, Mathematica, or Maple. When using such systems 
for a particular problem, one should be cautious about believing that the cal- 
culated result is indeed the global optimum (to within roundoff error), even if 
the routine exits with no reported error. There is often no guarantee that, if 
the algorithm returns a point x without reporting an error, then z is a local 
optimum (unless the problem is well-understood, such as when the problem 
is convex). However, present-day algorithms often return reasonable results 
for many problems of practical interest. 


Source code for unconstrained local optimization, that can be embedded in 
user-written software, is available from the Netlib repository, at 
http: //www.netlib.org/ 
as well as in various commercial and proprietary packages. 


9.2. Constrained Local Optimization 


Traditionally, many of the techniques used in unconstrained optimization, 
such as line searches, descent methods, and quasi-Newton updates, can be 
used for constrained optimization, but with various complications. A treat- 
ment of classic techniques for constrained local optimization appears in [32]. 
Much of the best software for constrained local optimization is proprietary 
(such as fmincon from the MATLAB optimization toolbox for the general con- 
strained problem). 


9.3. Constrained Optimization and Nonlinear Systems 


In unconstrained optimization, the problem of minimizing y occurs at a 
critical point, that is, at a point where Vy = 0. An analogous set of equa- 
tions for the general constrained optimization problem is derivable from the 
Lagrange multiplier conditions, and are usually called the Kuhn—Tucker equa- 
tions. The Kuhn—Tucker equations for the general optimization problem (9.1) 
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are: 
V¢y(ar) + u? Vg(x) + 0? Ve(z) 


uigi(z) 


F(X) = Um Gms () =0, (9.17) 


C1 (x) 
Cm (x) 


where u € R™, v € R™, Vg is the matrix whose 7-th column is Vg;, Vc 
is the matrix whose i-th column is Vc;, and the condition u > 0 (where the 
inequality is interpreted componentwise) must be satisfied. Points satisfying 
the Kuhn—Tucker conditions (9.17) are called critical points, or Kuhn—Tucker 
points of the constrained optimization problem (9.1). This is a system of 
n+m,+ mp2 equations in the unknown vectors x, u, and v. We have 


THEOREM 9.1 

If the functions y, c, and g are sufficiently smooth and x* is a local solution 
to the constrained optimization problem (9.1), then x* must satisfy the Kuhn- 
Tucker conditions, for some admissible choice of u and v. 


One place where derivation of the Kuhn—Tucker conditions and a proof of 
Theorem 9.1 can be found is [32]. 


REMARK 9.11 The Lagrange multipliers u and v are sometimes termed 
dual variables, while the original variables x are called the primal variables. 
The values of these dual variables have practical interpretations in the real- 
world problems that give rise to the optimization problem. This is especially 
true in linear programming: In linear programming, dual problems can be 
easily formulated. In such cases, the original problem is called the primal 
problem. The dual variables of the primal problem are the primal variables 
of the dual problem, and the primal variables of the original problem are the 
dual variables of the dual problem. Many optimization algorithms rely on the 
interplay between solutions of the primal and the dual. A good discussion of 
duality in linear programs appears in [29, Chapter 4], while duality is framed 
in terms of linear algebra in [32]. Practical rules for forming dual problems 
appear in [12]. 


In general, except for u; > 0, bounds on the Lagrange multipliers u and 
v in the Kuhn—Tucker conditions are not known, while it is advantageous in 
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some algorithms to know such bounds. For this and other reasons dealing 
with special cases, the Kuhn—Tucker conditions are sometimes replaced by 
addition of an additional parameter ug, replacing Vy in the first equation by 
uoVy, and adding an additional normalization condition, such as 


(0+ Suede) -—1=0. 
i=1 i=1 


These new equations, where u; € [0,1], 0 < 7 < mg, and v; € [-1,1], are 
termed the Fritz John conditions. 


9.4 Linear Programming 


Linear programs are a special type of constrained optimization problem in 
which finding the global optimum is tractable. Before considering the general 
linear programming problem, a common optimization problem in business and 
industry, we consider a simple example problem. 


Example 9.4 
Find 21,22 > 0 such that 


404 a 2x2 < 8 
224 Tr Ano < 8 


and such that 
2= 321 + 2x9 


is a Maximum. 

The solution of this problem is easy to find using a geometric argument. 
(See Figure 9.3.) Notice that the solution space is bounded by the lines x; = 0, 
rt = 0, 4a, + 2%2 = 8, and 27; + 4% = 8. Consider z = 32, + 2x2, which 
corresponds to a family of straight lines as z varies. Notice that the value z 
increases as the line is moved farther from the origin. Thus, z is maximized 


at ( 4, 4), with corresponding maximum value z = 2. 


REMARK 9.12 This graphical approach is practical only for simple 
example problems. 


We now define the general linear programming problem. 
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x2 


FIGURE 9.3: Graph of the constraint set and objective function z, Exam- 
ple 9.4. 


DEFINITION 9.2 The general linear programming problem in standard 
form is: 


maximize 
n 
zZ=cot+ ) CjX; (objective function) 
j=l 


subject to the linear equalities 
i (9.18) 


; ait; = b; fori =1,2,---,m4 
j=l 


and nonnegativity constraints 
x; >0 for 7 =1,2,--+ jn. 


Assumption: b; > 0 for each i and n > my (If n = my, the values of x; are 
uniquely determined when the matrix A = {a,;} is of full rank.) 


We first consider how problems can be converted to the above standard 
form. 


REMARK 9.13 The constant co does not effect the optimizing point z, 
but only affects the optimum value z. If we ignore co, then the standard form 
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for the general linear programming problem can be written in matrix form as 
maximize c! x 
subject to Ax = b (9.19) 
and x > 0, 
where? « € R", c € R”, A € R™*", b © R™, and “>” is interpreted 


componentwise. 


Different texts and software packages may define “standard form” somewhat 
differently than above. However, one “standard form” can be converted to 
another, as we now illustrate. 


9.4.1 Converting to Standard Form 


Case A: (Converting a minimization problem to a maximization problem) 
Suppose that we want to minimize 


Then, set z = —cg — y cjx;, and maximize z. 
j=l 


Case B: (Some b; < 0) If b; < 0, then replace the constraint 


by 


where b; = —);. 


Case C: (Replacing inequalities by equalities) Suppose that a constraint is 


n 
) AjyjX5 < b;. 
j=1 


Then, we introduce a new “slack” variable s; > 0 so that 


n 
; AjjX5 + $3; = b;. 
j=1 


5The vector c here has a different meaning from the vector function c used in the general 
form (9.1) on page 487. 
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Case D: (Some variables are not constrained to be nonnegative) Suppose 
that x; is not constrained to be nonnegative. We introduce ar > 0 and 
+ 


Ly > 0 such that x; =a“) -2;. 


Example 9.5 
Minimize —x, + x2 — x3 subject to 


Ly — 322 + 423 = 5, 
L1 — 2xq <3, 

222 — £3 > 4, 

x1 > 0, 

r2 > 0. 


We convert this to standard form: 
Maximize z = 41 — 29+ we — x3 subject to 


a — 3a + 4a — 4x5 =5, 
t — 2%. + 24 = 3, 
Qxo —af +235 — 25 =4, 


where 21, X2 ae 23, 4, and #5 are all nonnegative. Note that n = 6 and 
my, = 3 in the standard form problem. 


We now present a theorem that underlies a common method for finding 
solutions to linear programming problems. 


9.4.2 The Fundamental Theorem of Linear Programming 


THEOREM 9.2 

An optimum solution to the linear programming problem (9.18) occurs at a 
point for which at most m, variables are positive and the remaining n — m4 
variables are zero. (The usual case is that exactly m, variables are nonzero.) 


We give the proof of this theorem later, after we describe the simplex 
method for approximately solving (9.18). 


DEFINITION 9.3 A set of mi variables is called a basis. Points x € R” 
with n— my, components of x equal to zero and which satisfy the constraints 
Az = b of (9.19) are called basic feasible points or basic feasible solutions. 
Thus, a basic feasible solution consists of a nonnegative solution of m1 vari- 
ables to the m, linear equalities, with the remaining n — m, variables set to 
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zero. (In general, a point satisfying all of the constraints of an optimization 
problem is called a feasible point, or feasible solution.®) 


Theorem 9.2 suggests that one way of finding the optimum solution is by 
using a trial-and-error approach by selecting m , variables, setting the other 
variables to zero, and solving the resulting m, equations in m, variables. How- 
ever, there are n!/((n — m1)!m,!) ways of selecting m1 variables from n vari- 
ables; that is, there are n!/((m—m)!m,!) basic feasible points. (For example, 
ifm, = 6 and n = 15, relatively small values, then there are 15!/(6!9!) = 5005 
possibilities. However, linear programming problems are solved today with 
my, and n in the hundreds of thousands, or more; a search of all possibili- 
ties would require an astronomical amount of time, even with today’s most 
advanced computers. ) 

We now consider an efficient method of searching the basic feasible points, 
the well-known simplex method for linear programming. 


9.4.3 The Simplex Method 


The simplex method for finding solutions of (9.18) was invented in 1947 
by George Dantzig. In the early days of computing (through the 1960’s) 
it has been said that over half the time used on computer processors was 
spent performing the simplex method. The simplex method and its variants 
(including variants for structured problems, sparse constraint matrices A, etc.) 
are still extremely important. 


REMARK 9.14 _ In general, the simplex method does not execute in 
polynomial time. That is, there is a sequence of problems with increasing n 
and m, such that the amount of work for the simplex method to complete the 
problem increases at a rate greater than O(n* + mi), for any integer k. In 
contrast, interior point methods, originating from Karmarkar’s algorithm (see 
[42]) can be shown to execute in polynomial time, and, when implemented 
well, are practical for extremely large problems. Although thorough coverage 
of interior point methods is outside the scope of this book, a good introduction 
and overview can be found in [102]. In any case, variants of the simplex 
method are still highly practical for many problems, some with quite large 
n and m,, and some state-of-the-art software systems continue to use the 
simplex method. 


In the simplex method, basic feasible solutions are tested one-by-one while 
steadily improving the objective function value, until the optimal value is 


6A feasible solution is not to be confused with a solution to the optimization problem: A 
feasible solution is simply a solution of the system of constraints, without reference to the 
objective function. 
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reached. In fact, the simplex method can be viewed as a kind of steepest 
ascent method, proceeding from one extreme point of the feasible region to 
another, choosing to follow that edge of the simplex in which the objective is 
increasing most rapidly. We now present the simplex method informally. 


Step 1: Find an initial basic feasible solution. One sure way to find an initial 
basic feasible solution is to introduce m, additional variables 241, tn+2, 
-++, Ln4+m, Such that the linear equalities have the form 


a1 2X4 a22%2 ++: Ainkn +£n41 = by, 
a2121 a22%2 +++ +4an%n +2n+2 = bo, 
Am,1%1 +Am,2%2 a eine +Amin@tn +2n4+m, = bny 


Then, an initial basic solution is 
Tn+1 = bi, Tn+2 = bo, noes Intmy, = Deeks 


with the other variables equal to zero. (In the simplex method, 2,41, 
Lnt+2,°**,; Ln+m, will eventually be set equal to zero.) 


Step 2: Suppose that 71,--- ,2%m, is a basic feasible solution; using elemen- 
tary row operations (see Definition 3.12 on page 88), the linear system 
and objective function are put into the form 


/ / — 
t1 TF Oy om, 41%m141 a a1 ntn = by 


/ 
v2 TT 42 .m,41%mi41 


/ / — Bi 
Tm, 1 Gm, m,t1e%mit+1 aig oe ais Any ntn = b 


and Cm,410m,41 +7"° Chita 


II 
& 
a 
md 


and the basic feasible solution is x1 = bi, x2 = by, +++, &m, = O/,,, with 
the other variables equal to zero. 


Step 3 If c; < 0 for j = m; +1,--- ,n, then the basic feasible solution is 
optimal, since z is decreased if %m,+41,°+* ,%n are positive. Thus, the 
procedure is finished. 


Step 4: If ci are not all < 0, then choose 


c,= max ci. 
mit+l<j<n 


This choice increases z the most for a unit change in 75. 


Step 5: The variable x, will replace one of the variables 2; to am,. (In 
effect, we set one of x, through x, to zero, and make x, nonzero.) To 
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decide which variable to remove, note that each equation has the form 
r+) vs = 0). Thus, x, is limited in magnitude by 


Let 


for a; , > 0. Then 2, is the variable to be removed. 


(If a, < 0 for every i, then the optimal solution is unbounded. We give 
an example of this below.) 


Step 6 Replace the form in Step 2 with 


- - i 0 "se b,* 
Ti + a1 pr TT Q14,m,41%my+ t Ai ntn — ¥1 
Ise Ise ' 1 x 
2 + AgrLr + Agim ,41%m,41 4 0 Azntn = be 
ars 4 4 ‘ 
ay pLr es Qpmyt1%my+ oi le ie i ap ntn = b, 
"se " | 0 - =: b.* 
Tmy + am, ,r&r a Amy ym, +1%m,+ I “Ss am, nen = Om, 
* * ! se * 
Cpbpe ob Cm ,41%m,4+1 4 0 tee Ci Ly = 2 — Co 


That is, using Qs 4 as a pivot element, we subtract multiples of the r-th 


row of the system of equations in step 2 to eliminate aj, from the i-th 
row, 1 <i<m4+1,14 rT. Notice that ch = ch + Uic,/a\.., so z is 


T,8? 
larger. 
Example 9.6 
Maximize 2x; + x2 = z subject to 
en 222 < 10, 
a1 + r2< 6, 
Ly -— @2< 2, 
“1 — 2% < 1, 


where x, > 0,22 > 0. First we put this problem into the following standard 
form: 


Ly 2x2 x3 = 10 
Ly x2 r4 = 6 
U1 tar 25 2 
L1 2x2 Tr ve = 1 
221 +r &r = &@ 
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1. An initial basic feasible set is 73 = 10, 74 = 6, v5 = 2, xg = 1, 21 = 0, 
zt =0, and z=0. 


2. Choose 7, = 21 and x, = xv (add x; and remove x). The new system 
has the form: 


4ro + x3 —- = 9 

322 + @4 -— = 5 

x2 +25 - = 1 

4 i 229 + “= 1 
5x2 —2¢5=2-2 


The new basic feasible set is 7; = 1, 73 = 9, v4 = 5, @} = 1, ro = O, 
ve =0, and z=2. 


3. Choose 2, = £2 and x, = x5. The system then has the form: 


X3 — Ars + 326 = 5 

4 — 325 - 2276 = 2 

v2 + 425 -— t= 1 
Ly — 245 — ve = 3 
— 545 + 32g = z—7 


The new feasible set is 71 = 3, v2 = 1, 73 = 5, 4 = 2, v5 = 0, re = O,*7 
and z = 7. 


4. Choose x, = x and x, = x4. The resulting system then has the form: 


Bae 3x4 + 425 = 2 

$04 = 325 + % = 1 

x2 + 404 _ 45 = 2 
a+ $04 + 425 = 4 
— 324 — $25 =z-10 


The new basic feasible set is 71 = 4, ro = 2, 73 = 2, x = 1, and z = 10. 
This is the optimum value. 


Notice that by brute force (trial-and-error), sat = 20 linear systems would 


require testing. The simplex method only required 4. 
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9.4.4 Proof of the Fundamental Theorem of Linear Pro- 
gramming 


PROOF The equality constraints in (9.18) can be written 


by 
bo 
Uyxy + VQHQ +++ + UnTn = b= : , 
bmi 
n 
where x; > 0 fori =1,---,n, > qa; = 2z—co, and vy, = A,j, i =1,---, 0 


i=l 
are mj -dimensional column vectors. 
Assume that we have an optimal solution in which r variables, say the 
first r, are positive and n — r variables are zero. If r < m4, then the the- 


orem is true, so let us assume r > m4. Let (41, 22,-°+ , pr, 2rqi,++ 2n) = 
(%1,%2,--+ ,Z,,0,0,--- ,0) be this solution. This means that 

U141 + Veo t++++Urptp =b, £1 >0,f2 >0,---+ ,£, > 0, 
and 


C12, + coo +--+ + C2, = Z. 

Since r > mj, the vectors v1, v2,--: , Ur must be linearly dependent. It follows 
that there is aset of numbers, not all zero, such that vyyi+v2yot+::-+Uryr = 0, 
and we may assume that at least one y; > 0 (otherwise we can multiply by 
(—1)). 

Now, set t = max(y;/Z;), a positive number. Then, we can show that 

U1 (a - 4) + v2 (i - 2) tees + Up (@ - *) = b. 
t t t 

That is, ;—y,/t is also a solution. In addition, @; > y,;/t implies ;—y,/t > 0. 
Moreover, by the definition of t, at least one of these is zero. Thus, we have 
a feasible solution in which fewer than r are positive. 

We now need to show that the solution is also optimal, i.e., that 


BP) Aton bac (Gp. A) = e181 + cata +--+ ed, = 3%, 


which is true if cyy; +--+: +c,-y, = 0. If this were not true, we could find u 
such that 


ul(cry1 +++ + Cryr) = c1(uy1) + co(uyo) +--+ + ¢,-(uy,) > 0. 


Adding >> c;%; to both sides, we obtain 
j=l 


C1 (#1 + uyi) + co(Z2 + uy2) ++ +e (&p + uyr) > CF + Cofo+-+-+¢-%, = 2. 
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But z; + uyj, 1 < 7 <r, is easily shown to be a solution for any u. By 
making wu sufficiently small, it would be a nonnegative solution. But this 
would mean that %; + uy; would give a larger value of z than the z;, which 
is a contradiction. 


We have thus proved that the number of positive variables in an optimal 
solution can be reduced if r > m,. Thus, we are led to a solution in which at 
most m, variables are positive. 


9.4.5 Degenerate Cases 


Although the Fundamental Theorem of Linear Programming states that a 
maximum value of the objective function —y(x) = co + a c;Z; occurs at 
a point for which at least m, of the x; are nonzero, it does not rule out other 
points at which —y(x) is maximum. That is, there may be more than one 
point at which —y obtains its (unique) maximum value. 


Example 9.7 


Minimize 104+ 3.5B+4C+3.2D 
Subject to: 
100A+50B+ 80C + 40D < 200,000, 
12A+ 4B4+48C+ 4D> 18,000, 
0 < 100A < 100000, 
0< 50B < 100000, 
0< 80C < 100000, 
0< 40D < 100000, 


Any point along the portion of the line B = 0, D = 2500, A and C given para- 


metrically by 
A)\ _ (666.6 ine —0.3714 
Oo mae 0 0.9285 } ’ 


with 0 < C < 1250, is a solution to this problem. 0 


Such degenerate problems are relatively common in practice, and occur 
when the vectors formed from the coefficients of the objective function and the 
vectors formed from the coefficients of the constraints are linearly dependent. 
Some software for solving linear programs removes some linear dependence by 
preprocessing, prior to applying the simplex method or interior point method, 
but linear dependence such as in Example 9.7 is intrinsic to the problem. Many 
software systems will return the optimum value for —y and an optimizing 
point that happens to be at a vertex, without indicating that this optimizing 
point is not unique. 
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9.4.6 Failure of Linear Programming Solvers 


Linear programming solvers can fail, due both to intrinsic properties of the 
problem posed to them and also due to numerical issues, such as roundoff 
error or efficiency considerations. There are two types of problems whose 
formulations do not admit solutions. 


e unbounded problems, and 


e infeasible problems. 


feasible set 
increasing objective 


FIGURE 9.4: Graph of the constraint set and the objective function z for 
Example 9.8, an unbounded LP. 


Example 9.8 


(Illustration of an unbounded linear program) Consider maximizing x7; + 222 
subject to 


—at+a2<1, 1-242 <1, 150, wo >0. 


The feasible set is the shaded region in Figure 9.4, while a level curve of 
the objective is given by the dashed line. Observe that one can increase the 
objective without bound by choosing only points within the feasible set. 
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REMARK 9.15 The feasible set must be unbounded for a linear pro- 
gram to be unbounded,’ but the converse is not true. In particular if, in 
Example 9.8, the objective were to minimize x; + 2%2 rather than maximize 
x1 + 2x9, then the problem would have a solution at x; = 0, x2 = 0, and the 
simplex method would find it. 


Example 9.9 
(Example of an infeasible linear program) Consider maximizing x1 + x2 sub- 
ject to 
tel, 24,-x<-4, 220, we > 0. 


Then there are no points that satisfy all of the constraints, so the feasible set 
is empty. 


REMARK 9.16 The notions of unbounded problems and infeasible prob- 
lems carry over to the general nonlinear optimization problem (9.1), but the 
ways that such problems can occur are more complicated in the general case. 


One reference with examples and references to problems that can occur due 
to roundoff error in linear programming solvers is [63]. 


9.4.7 Software for Linear Programming 


Due to the commercial value of linear programming software, many of the 
most efficient and reliable linear programming packages are at present pro- 
prietary, available for licensing fees. Some of these are in add-on packages 
to interactive systems, such as linprog in MATLAB’s optimization toolbox. 
However, there exist various reasonably competitive free packages, such as 
CLP from the COmputational INfrastructure for Operations Research (COIN- 
OR) project (see http://www. coin-or.org/). 


9.4.8 Quadratic Programming and Convex Programming 


Quadratic programs are instances of the general optimization problem (9.1) 
in which the objective y is quadratic and the constraints c and g are linear. 
That is, y is of the form 


1 
p(x) = xu He +hiat+d, (9.20) 


“One way of seeing this is to remember the general theorem that says that a continuous 
function over a compact set must attain its maximum and minimum on that set. 
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for some symmetric matrix H. If the Hessian matrix H of ¢ is positive 
definite, then the quadratic programming problem is tractable, the problem 
can be posed in terms of a linear program. 

There are special algorithms for quadratic programming, such as quadprog 
from the MATLAB optimization toolbox. 

Increasingly, software is also becoming available for convex programs. These 
are instances of (9.1) in which the function y and the functions g; are convex, 
and there are either no equality constraints, or else the equality c; constraints 
are linear. 


9.5 Dynamic Programming 


Dynamic programming is a technique for solving a special class of optimiza- 
tion problems called multi-stage decision processes [54, 65]. It is difficult to 
present a specific mathematical form for the class of optimization problems 
which dynamic programming can solve. Dynamic programming is a compu- 
tational technique that generally is used to reduce a difficult problem in n 
variables into a series of optimization problems in one variable through ap- 
plication of recurrence relations. (The whole method might well have been 
named recurrence optimization.) The possibility of applying dynamic pro- 
gramming depends on a successful formulation of the problem in terms of a 
multi-stage decision process. Two examples are presented here that illustrate 
the technique. 


Example 9.10 

Suppose that you own a lake and each year you either fish and remove 70% 
of the fish, or you do not fish. If you fish and remove 70% of the fish, the 
fish that were not caught reproduce, replenishing the original population. If 
you don’t fish, the fish population doubles in size the next year. The initial 
fish population is 10 tons. Your profit is $1000/ton and the interest rate is 
constant at 25%. You have 3 seasons to fish. What procedure will optimize 
your profit? 

A schematic diagram of the fish population is given in Figure 9.5. A 
schematic diagram of present values of profit in $1000’s is given in Figure 9.6. 
(Recall that the present value at time 0 is equal to the amount of money at the 
n-th year divided by (1+ 7%)”.) The optimum profit at each node is obtained 
by working backward. The optimum strategy is indicated by the darkened 
path. 


Dynamic programming solves the problem step-by-step, starting at the ter- 
mination time and working back to the beginning. For this reason, dynamic 
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Numbers underneath 
branches refer to tons 
of fish that were 
fished. 


i 
(year 0) (yearl1) (year 2) (year 3) 


FIGURE 9.5: Diagram of the fish population for Example 9.10 of a mul- 
tistage decision process. 


programming is sometimes characterized by the phase “it solves the problem 
backward.” The next example will make this clearer. 


Example 9.11 
Suppose that you own a gold mine. You wish to operate the mine for ten 
years. The cost each year to extract z ounces of gold is $500z?/a2 where « is 
the number of ounces remaining at the beginning of the year. Assume that 
the price of gold is $400/oz = g, the interest rate is 10%, and 2a = 50,000 
ounces. What is the maximum amount of profit? 

Let 


according to the 10% interest rate to be in dollars at the beginning 
of the 10 years. 
x; = {number of ounces of gold remaining at the beginning of the j-th year,} 


fic value of the mine from the j-th year to the last year, adjusted 
"3 7 


g = {the price of gold (assumed to be constant), } 
2; = {number of ounces of gold extracted in j-th year}, and 
d = {the discount factor} = 1/1.1. 


We consider first the last year. Assuming no gold is left at the end of the 
10-th year, we have 


2 
= = 2 _ 9G % 
Va = gman. 1920 — 5029/29} = S009 = 
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Notice that the 
optimum profit at 
each node is obtained 
by working 
backwards. 


20.16 


(year 0) (year 1) (year 2) (year 3) 


FIGURE 9.6: Profit diagram for Example 9.10 of a multistage decision 
process. 


where kg = g?/2000. Now, considering the previous year, we have 29 = Xg—2s, 
and 


2 
Ve = jmnax, {92s — 5002/25 + dVo} 
° — 50022/ag + dko(xg — 
pmax, {928 500z3/xg o(as 28)} 
(g — dkg)? 
=| sang oe 
a000  * 2*9| #8 
= kez. 


In general, for the j-th year, we have x;4, = x; — z;, and 


Vj = oe. { 92; — 50023 /2; + dVj41} 


= max {gzj—- 50025 /a; + dkj41(2; — 2;)} 


O<za<arg 
—_ + dkj41| 2;, 
that is, 
V; =kj;x; where kj = (gto) + dkj+41- 


Thus, Vo = koo is the optimal value of the mine, and ko can be determined 
by working backwards. Indeed, we have the following table. 
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Ki 
80.00 
126.28 
155.47 
174.79 
187.96 
197.13 
203.58 
208.17 
211.45 
213.81 


OrPFMNMW BOLD N CO Oe. 


Hence, Vo + $213.81 x 50, 000 = $10,691, 000. Also, we may find the optimum 
mining strategy. In particular, z; is the maximizer in the j-th subproblem, 
namely, 


— dk; ; 
jp 2 EOE Gee. et oe, 
1000 1000 
sO 
ste core, hep ella ao 
i a 1000 |?” 


with 2x = 50000. 


9.6 Global (Nonconvex) Optimization 


Global optimization, in contrast to local optimization, is the process of 
finding the minimum of y in problem (9.1) over the entire feasible set, if such 
a minimum exists. As discussed in the introduction to this chapter, there are 
no known algorithms that are guaranteed to find approximate solutions to all 
global optimization problems with n variables in O(n") time, for any integer 
k. In fact, present algorithms that are guaranteed to find a global optimum 
for the general case® of problem (9.1) subdivide the region in a technique akin 
to adaptive quadrature. 

Nonetheless, global optimization problems are important in applications, 
and research on efficient algorithms has exploded during the past several 
decades. Such algorithms can be classified in various ways. One classification 
is as follows. 


Sthat is, without further assumptions on y, c, and g, other than perhaps differentiability 
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deterministic algorithms: These are algorithms that are certain to find the 
global optimum and global optimizers to within the specified accuracy 
provided that no roundoff errors are made and the algorithms continue 
to completion. 


heuristic algorithms: These algorithms, which can include statistical tech- 
niques, may work well for many problems, but there is no guarantee 
that the answers they provide are close to an actual global minimum. 


Among deterministic algorithms, the algorithms can 


e attempt to find merely an approximation to the global optimum and a 
single set of parameters x corresponding to the global optimum; 


e attempt to find not only an approximation to the global optimum, but 
also approximations to all optimizing points? 2. Such algorithms are 
often called complete search algorithms. 


Deterministic algorithms may also be automatically verified, in which, if 
the algorithm finishes, it is guaranteed!° to return mathematically rigorous 
bounds (given by machine-representable numbers) on the global optimum and 
the global optimizing points; in automatically verified complete search algo- 
rithms, the algorithm (if it completes) is guaranteed to supply mathematically 
rigorous bounds on all of the coordinates of all points 7 € R” with p(x) = y*. 

As mentioned earlier in this chapter, there are classes of problems with 
relatively fast deterministic algorithms. These include 


e linear programs; 
e quadratic programs; 

wads ll 
e convex optimization problems; 


e problems with special structure from various important application ar- 
eas. 


There are also important problem classes with special structure, but for 
which fast deterministic algorithms are not known in general. An example of 
this is 


integer programming problems: In these problems, some or all of the 
variables x; are constrained to be integers. (In principle, such con- 
straints can be made to fit into the general form (9.1) (on page 487) by 
appending constraints of the form c;(x) = sin(7x;) = 0. However, spe- 
cial techniques are often employed for integer programming problems.) 


* 


°Even though the global optimum y* must be unique, it may be that v(x) = y 
points x. 


at many 


10unless there is a programming error or a computer hardware malfunction 


llin which the objective and all of the constraint functions are convex 
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Heuristic algorithms take various forms. Several important classes of heuris- 
tic algorithms include a statistical component. 

In the remainder of this section, we briefly introduce several of the more 
prominent current general techniques, both heuristic and deterministic, not 
verified and automatically verified, for global optimization. 


9.6.1 Genetic Algorithms 


Genetic algorithms are a heuristic global optimization technique with a ran- 
dom component [11, 38]. Genetic algorithms (which we abbreviate “GA’s” ) 
are modeled after genetics and evolution. GA’s are designed to find low ob- 
jective values for large, complex problems. The search proceeds in a survival- 
of-the-fittest manner. The evolution of a population of potential solutions is 
monitored until the superior parameter values dominate the population. 

The starting point in using GA’s is by representing the problem in a “biolog- 
ical” manner. This often leads to a binary representation, that is, representing 
the parameters of the problem using a string of binary digits. Perhaps the 
best way to understand GA’s is by considering a simple problem. 


Example 9.12 
Consider the problem of finding 0 < 2 < 63 that minimizes 


v(x) = x? — 60x — 100. (9.21) 


The solution to this problem is « = 30. We will represent this problem in 
binary form as finding 000000 < x < 111111 so that y(x) is minimum. 

The solution is x = 011110. This problem has a biological parallel. The 
bit string can be viewed as a chromosome-type structure. The 0’s and 1’s 
correspond to genes in the chromosome. Just as a chromosome can be decoded 
to reveal characteristics of an individual organism, a bit string can be decoded 
to reveal a potential solution. 


A genetic algorithm procedure consists of the following steps. Each step 
will be illustrated with Example 9.12. 


Step 1: An initial population of potential solutions is randomly created us- 
ing a random number generator. Suppose that our initial randomly 
generated population consists of the four strings: 


000001, 110100, 111000, 010100 in decimal: 1, 52, 56, 20. 


Step 2: Calculate the performance or fitness of each individual in the popu- 
lation. GA’s are typically implemented in such a way that performance, 
or fitness, is the value to be minimized. 
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For our problem, a good measure of fitness is y(x). The lower v(x) 
is, the more fit string x is to survive and to breed. For the initial 
population, we have the table: 


String x 


Decimal equivalent 


Fitness v(x 


000001 1 least fit 
110100 52 
111000 56 


010100 20 (most fit) 


Step 3: Select individuals to be parents for the next generation. Basically, 
better performing individuals are chosen as parents. Poorly performing 
individuals do not produce offspring. One possible scheme for selecting 
parents is by throwing away the worst-performing string and replacing it 
with the best-performing string. Applying this procedure to our example 
yields the strings: 


010100, 110100, 111000, 010100. 


Step 4: Create a second generation of children from the parents. There are 
a variety of genetic operations that can be performed at this stage. 
One of the most powerful methods is crossover, and is motivated by 
an analogous biological process. In crossover, two strings are randomly 
chosen. A point is randomly chosen at which the two strings are to be 
cut. Another random number is then chosen to indicate whether or not 
crossover is to be performed. If crossover is to be performed (perhaps 
set for 60% of the time), then the tails and heads of the two strings are 
exchanged. Notice in the crossover process, several random selections 
were performed. A pair was randomly chosen, where to cut the strings 
was randomly decided, and whether or not to perform the crossover 
was randomly decided. Crossover is a powerful process that extends the 
search in many directions. 


Consider our example. Suppose crossover is decided for 010100 and 
111000. They are cut at 0—10100 and 1—11000. Crossover is performed 
to yield the new population: 011000, 110100, 110100, 010100. 


Mutation is a second possible genetic operation that can be performed. 
(This is also inspired by a biological operation.) In mutation, a partic- 
ular bit in a particular string is randomly selected and changed from 0 
to 1 or from 1 to 0. Generally, mutation is performed infrequently. 


Crossover and mutation are common genetic operations used in genetic 
algorithms. Crossover tends to pass on attractive patterns from one 
generation to the next. Mutation gently nudges the search in slightly 
different directions. 
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Step 5: Steps 2 through 4 are repeated until the population converges. The 
parent selection scheme ensures that good strings drive out bad strings. 
Crossover spreads pieces of well-performing strings to poor-performing 
strings. As these steps are repeated over and over, the population grad- 
ually becomes more homogeneous. When all strings are identical, the 
population is said to be fully converged. 


Consider now two more generations of our example. 


Step 2: (Second generation) 


String «x Fitness p(x 


Decimal equivalent 
24 


011000 -964 
110100 52 -516 
110100 52 


010100 20 


Step 3: 110100 is replaced by 011000 to yield the population: 


011000, 011000, 110100, 010100. 


Step 4: Crossover is performed with strings 011000 and 010100. The strings 
are cut at 011 — 000 and 010 — 100. The new population is 011000, 
010000, 110100, 011100. A mutation is made to the first string in the 
fifth position. The new population is 011010, 010000, 110100, 011100. 


Step 2: (Third generation) 


String x | Decimal equivalent | Fitness y(x 


Step 3: 110100 is replaced by 011100 to yield the population: 


011010, 010000, 011100, 011100. 


Step 4: Crossover is performed with strings 011010 and 011100. The strings 
are cut at 0110 — 10 and 0111 — 00. The new population is 


011110, 010000, 011100, 011000. 


Step 5: (Fourth generation) 
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String x | Decimal equivalent | Fitness y(a 


011110 30 -1000 
010000 16 -804 
28 


24 


Notice that the fourth generation is much more fit than the first generation. 
Much more complicated problems can be approximately solved using a genetic 
algorithm procedure. 


9.6.2 Simulated Annealing 


Simulated annealing is another heuristic technique with a random compo- 
nent. However, as the name implies, simulated annealing is modeled after the 
annealing process in metallurgy. In addition to web resources for simulated 
annealing, the monograph [95] provides an introduction. 


9.6.3. Branch and Bound Algorithms 


In branch and bound algorithms, the entire domain is systematically searched, 
eliminating portions of the domain that cannot contain optimizers. 


9.6.3.1 Branch and Bound Procedures in General 


Branch and bound processes are a general class of deterministic algorithms 
that proceed by the following steps. 


1. An upper bound @ on the global optimum over the feasible set (defined 
by the constraints) is established. 


2. The initial region D is subdivided into two or more subregions D. (This 
is the branching step, where we branch into subregions.) 


3. The range of the objective function is bounded below over each subregion 
D, to obtain 


p(D) < {y(x) | « € D, (x) = 0, g(x) < 0}. 
(This is the bounding step, where we bound the range of yp.) 
4. IF p > @ THEN 
D is discarded, 
ELSE IF the diameter of D is smaller than a specified tolerance THEN 


D is put onto a list of boxes containing possible global optimizers. 
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ELSE 


D is put onto a list for further branching and bounding through 
steps 2 and 3. 


END IF 


The upper bound G on the global optimum is sharpened, often with consid- 
eration of each new region D, for example, by evaluating y at a point x € D 
that is known to be feasible. 

Some of the ways the bounding process can be done are 


e by using Lipschitz constants to bound y, c, and g, 
e by interval arithmetic, or 


e more generally, by relaxations. 


DEFINITION 9.4 A relaxation of the global optimization problem (9.1) 
is a related problem, where p is replaced by some related function ~, each set 
of constraints is replaced by a related set of constraints, such that, if y* is the 
global optimum to the original problem (9.1) and y! is the global optimum to 
the relaxed problem, then yt < oy. 


REMARK 9.17 In particular: 
e if one deletes one or more constraints, one obtains a relaxation. 
e If one: 


1. replaces y over D by ¢ such that G(x) < y(ax) V x € D, and/or 


2. replaces one or more c¢;(x) = 0 by a pair {c;(x) < 0, —c;(x) < O} 
then replaces c;(a) < 0 by é;(a) < 0, where é;(x) < c;(a), Vx € D, 
and/or replaces —c;(a) <0 by —é;(a) < 0, where —é;(x) < —c;(zx), 
Va € D, and/or 

3. replaces one or more g;(x) = 0 by g;(a) < 0, where g;(x) < gi (x), 
Va € D, 


then the resulting problem is a relaxation to the original problem. 


0 


An important type of problem to which relaxations typically are applied is 
integer programming problems. A natural relaxation for such problems is 
to ignore the constraint that a variable x; be integer, and treat x; as a real 
variable. 
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Example 9.13 
Let the global optimization problem be defined by 


(p(#1, 22) = 2x? — Qay a0 4+ 22 — 421 +4, 


subject to Ll<a,<1l, l<a.<11, & and g integers. 


This integer programming problem has global optimum y* = 0 at the unique 
global optimizer 7; = x2 = 2. 


We may apply a relaxation to Example 9.13 by assuming the variables 
are real. To obtain an initial G, we use a local optimizer to obtain a point 
&, possibly adjusting # so that it has integer coordinates, then taking — = 


y(#). For illustration purposes, suppose we have obtained & = (1,1), so 
= v(1,1) = 1. Suppose also that we start with the initial region 


D=e= ({1,11], [1,11))7, 


and we subdivide (“branch”) on x by bisecting the widest coordinate of x. 
The branch and bound algorithm would then proceed as follows. 


me x2) 


(fathomed) 


x) x12) 


FIGURE 9.7: The search tree for Example 9.13 


Step 2: Cut x into 
a) = ((1,11],[1,6])7 and «# = ((1, 11], [6,11))’. 
Step 3: Here, for simplicity, we will use interval arithmetic to bound y, al- 


though there are many alternative techniques, such as the ones we ex- 
plain in Section 9.6.3.2 below. For illustration purposes,!? we rewrite y 


12t0 avoid obscuring the process by having to do too many steps 
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here as 
p(x) = (#1 — #2)? + (a2 — 2)?. 


(In this form, there is less overestimation of the range in the interval 
evaluations.) Using interval arithmetic on 2, we obtain y(a™)) C 
(0, 116]. 


Step 4: Since G € [0,116], we cannot delete «); therefore, we place « on 
a list £ to be processed further in Step 2. 


Step 3: Now working on #), we obtain y(a)) C [16, 181]. 


Step 4: Since G ¢ [16,181], we may remove «#?) from further processing. We 
say that #2) has been fathomed. 


Step 2: Now, x“) is at the top of the list for further processing.'? In that 
case, we form 


we) = (1,6), [1,6])” and &%) = (6,11), [1,6))7. 


Step 3: We have y(a)) C [0,41]. 
Step 4: Since G € [0,41], we must put «@) onto £ for further processing. 


Step 3: We have y(a‘!?)) C [0,116]. Since @ € [0,116], we must also store 
a?) in the list. 


This process can be depicted by a search tree, as is illustrated in Figure 9.7. 
Branch and bound processes differ widely in their ability to solve problems 
efficiently. Items affecting efficiency include 


e the techniques used to get upper bounds G on the global optimum; 


the techniques used to obtain lower bounds on the objective over sub- 
regions of the domain D; 


the way that the list £ of boxes that haven’t yet been fathomed is 
ordered; 


e acceleration procedures, used to eliminate a subregion or reduce its size 
prior to subdivision; 


e the way that a region is subdivided (into how many subregions, where 
the cuts are made, etc.). 


13There are various methods for ordering the list L. 
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9.6.3.2 Some Special Relaxations 


A common type of relaxation is a linear relaxation, in which the objective 
and constraints are replaced by linear functions. A big advantage of this 
is that well-developed linear programming technology, more tractable and 
better understood than general nonlinear optimization, can be used to find 
solutions to the relaxations. Thus, this kind of relaxation is commonly used 
in software that does not employ interval evaluations. Various techniques can 
be used to obtain such relaxations; some of these are general and can be done 
automatically by the machine. An example of such a technique appears in [47]. 
Such automatic creation of relaxations can proceed by decomposition into 
elementary operations, as in automatic differentiation (see page 329), or by 
more sophisticated techniques. Such a decomposition technique is illustrated 
in [47]. 

Convex relations are also sometimes used. 


9.6.3.3. Acceleration Procedures 


The basic process in branch and bound algorithms is to bound the optimum 
above, bound the ranges of the objective and constraints over subregions, then 
reject those subregions over which the range bounds show that the problem 
either must be infeasible or the objective function must be greater than the 
upper bound on the optimum. Although practical for some problems, this 
simple basic technique requires much computation, especially in higher di- 
mensions, where, with n variables, bisecting uniformly so the resulting box 
has half the diameter of the original box results in 2” sub-boxes. For this rea- 
son, various other techniques are used in practical software systems to reduce 
the volume of region to be searched. We now mention a few of these. 


Use of Local Optimization Software. Local optimization software gen- 
erally completes its computations much more rapidly than a branch and bound 
procedure. If such software provides a point £ such that @ is feasible, then @ 
may be replaced by the minimum of its previous value and y(#). It is likely 
that such & will have lower values of y than y(x), where x is randomly chosen. 
Furthermore, for constrained problems, it is only valid to use values v(x) if 
x is feasible, and local optimization software will usually converge to such 
feasible points, and often converges to a point near a global optimizing point. 
Lower values for Y, especially when found early in the branch and bound 
process, lead to earlier rejection of larger sub-regions, reducing the need for 
further branching. 


Constraint Propagation. A simplistic view of constraint propagation is 
in terms of solving the nonlinear relation representing c; = 0 or g; < 0, or 
a component of the Kuhn—Tucker conditions for one variable, then comput- 
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ing new bounds on that variable based on the present bounds on the other 
variables, similar to the nonlinear Jacobi method we explained on page 461. 


Example 9.14 


Consider 
2 2 


minimize p(x) = vy — x4 
subject to x7 +232 =1, 


L117 < 0. 


Suppose the goal is to find all optimizing points, and suppose we have used 
a local optimizer to find the feasible point @ = (1,22) = (0,—1), with 
y(&) = —1 = G@, and suppose we are searching in the initial box (21,72) € 
({-1, 1], [-1,1]). Then, in addition to the constraints, we have the condition 


gat <-16 {2 < \/23—1 and x1 > - 3-1} 
o {n> 4/2? +1 or a s—yfap ih. 


That is, we can use the upper bound @ on the global optimum and to either 
reduce the range of x1, given a range on £2, or to reduce the range of x2, given 
arange on 2. Taking x2 € [—1,1] and substituting into the inequalities for 
x1, we obtain 


@: Safe 4/0 1a 0). 


Although there are values x € [—1,0] for which \/z is not defined as a real 


number, in this context, it is appropriate to interpret ,/[—1,0] to mean a 
bound on the range of ./~ over those values of x € [—1,1] at which \/z is 
defined.'* Interpreted this way, ,/[—1, 0] = [0,0], so 

Ly < 0. 


Taking the other condition x, > —\/«3 — 1 similarly gives x; > 0, and com- 
bining gives 7, = 0. Now, solving for x2 in the equality constraint 27+23 = 1 


gives 
wg=\/l—a} or w2=-1/1- a7. 


Since the previous computation established that 2; € [0,0] = 0, we obtain 
v2 = 1 or x2 = —1, giving only two points (0,—1) and (0,1). Substitut- 
ing the second point into the inequality constraint leads to a contradiction, 


14Some systems for interval arithmetic do interpret interval values of /- in this way, while 
others do not. In the context of interval Newton methods, it is important that computation 
not continue when part of the argument is out of range. INTLAB interprets \/[—1,0] as a 
set of numbers in the complex plane, which is appropriate in yet other contexts. 
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thus proving that (1,22) = (0,—1) is the only possible optimizing point for 
the problem within the box ({[-1,1],[—1,1]), without any need for further 
branching in the branch and bound process. 


Constraint propagation has numerous variants, depending on which rela- 
tions are solved for which variables, and depending on whether or not the 
system of constraints is modified. Computer languages have been developed 
specifically for constraint propagation, also called constraint programming. 
One reference on the subject is [5]. 


Techniques for Sharper Lower Bounds on the Range. A commonly 
used technique is to replace the optimization problem (9.1) by a linear re- 
laxation, resulting in a linear program. If the original optimization problem 
is convex, then the problem can be approximated arbitrarily well by a linear 
relaxation, and the lower bound for y over a particular region can be com- 
puted sharply by solving a linear program.!> Although nonconvex nonlinear 
programs cannot be approximated arbitrarily closely by linear programs, re- 
laxations can still be obtained, and these relaxations can give better lower 
bounds on the range of y than, say, a naive evaluation using interval arith- 
metic. One way of viewing the reason for this is that by simply evaluating 9, 
we do not take into account that a portion of the range is over points that 
are infeasible, whereas the solution to a linear relaxation does include the 
constraints. 

Linear relaxations can be used in various ways other than for determining 
lower bounds on the range of y. Some of these ways are described in [93], 
although additional research has been done on the subject since then. 


Interval Newton Methods. Proposition 8.2 (page 464) states that any 
solution of F(a) = 0 that are in x must also be in the image of x under 
an interval Newton method, while Theorem 8.9 (page 465 ) states that, if 
a Lipschitz matrix is used in the interval Newton method and the interval 
Newton method maps x into the interior of x, then there is a unique solu- 
tion to F(x) = 0 within x. These facts can be used in a branch and bound 
algorithm: Instead of subdividing a box a, an interval Newton method can 
reduce the volume of x through an iterative process, with a guarantee that 
no critical points of the optimization problem are lost.'® This is effective 
for some problems, but tends to be effective primarily when the Jacobian 
matrix for F is well-conditioned. In fact, however, the Jacobian matrix for 
the Kuhn—Tucker conditions (9.17) typically contains singular matrices unless 


15which may be a large, sparse linear program 
16Here, F(x) = Vy if the problem is unconstrained, and F(x) is defined to be the Kuhn— 
Tucker function (9.17) or the Fritz John function for constrained problems. 


530 Classical and Modern Numerical Analysis 


fairly sharp bounds on the Lagrange multipliers u and v are known. Gener- 
ally, interval Newton methods are most practical in finding sharp bounds in 
which it is proven that a solution exists, provided a good approximate solu- 
tion is already known, while interval Newton methods are not as effective in 
narrowing wide bounds. 


Explicit Constructions. A phenomenon called the clustering effect typ- 
ically occurs in the basic branch and bound algorithm. Because @ can be 
higher (however slightly) than the actual global minimum, and because the 
lower bounds y(a) on the range of y over boxes x that are adjacent to boxes 
that contain the global minimum is less than the exact lower bound on the 
range, and since the exact lower bound on the range of y is near the global 
optimum because of the continuity of y and the fact that aw is near an op- 
timizing point, it can happen that y(a) < G, even though x cannot contain 
any global optimizing points.!” The result is that the algorithm produces 
clusters of boxes around optimizing points, which under certain conditions, 
are even larger if the stopping tolerance (giving a box size for which the box 
is no longer subdivided) in the branch and bound algorithm is made smaller. 
This phenomenon was first analyzed mathematically in [26, 46], and later in 
[79]. 
The following procedure can ameliorate the clustering problem. 


1. Assume that an approximate optimizing point % has been found. 


2. Construct a small box « centered about “, such that x has diameter 
an order of magnitude larger than the box diameter tolerance for the 
branch and bound algorithm. 


3. Eliminate & from the search region. 


A similar technique appeared in a general setting in [43], while an explicit 
method for eliminating « from the search region is explained in [44, §4.3.1]. 

Interval Newton methods can also help the clustering problem, when they 
are applicable. 


9.6.3.4 Considerations for Automatically Verified Algorithms 


Interval arithmetic is commonly used in commercial software'® employing 
branch and bound methods for global optimization, to economically compute 
bounds on ranges and in constraint propagation processes. However, such 
software usually does not claim to find mathematically rigorous bounds on all 
optimizing points, and special care must be taken in those cases. 


17This is also true in constrained problems, where « can be near the feasible set, even 
though it does not contain any feasible points. 
18such as BARON [93] 
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For example, values of @ can be obtained in constrained problems by evalu- 
ating y at feasible points x. In algorithms that do not take account of roundoff 
error and other computational errors (such as terminating an iteration when 
|v~41 —2,x| is small but not equal to zero), it is permissible to set G to &, where 
& is approximately feasible. Such G@ are usually close to an upper bound on 
the global optimum, but may be slightly less than the actual upper bound. 
In most cases, branch and bound algorithms using such approximate @ will 
give good results anyway. However, in some cases, they may neglect to find 
some optimizing points, or may fail to approximate an actual optimizing point 
well. For software to claim that it obtains rigorous bounds on all optimizing 
points, it would typically need to bound the range of y over a small box # that 
has been proven (say, with an interval Newton method) to contain feasible 
points.!? This fact can lead to additional computation, including significantly 
more branching, in algorithms that claim to rigorously bound all optimizing 
points. 

One aspect of rigorous branch and bound methods that can cause them not 
to finish in a practical amount of time, while nonrigorous ones will, is when the 
optimizing points are not isolated. For example, if Example 9.7 is computed 
using branch and bound software that does not have a facility to look for affine 
sets (or, more generally, manifolds) of optimizing points, then the entire set 
of optimizers must be covered by small boxes, and the branch and bound 
algorithm will produce, through branching, numerous small boxes near the 
set that covers the optimizing points. This can be impractical, especially if 
the number of variables is large. 


9.7 Exercises 


1. Fill in the details of the solution process for the golden mean 
a = (v5 —1)/2 
in Remark 9.2 on page 491. 
2. Fill in the details of the proof of Lemma 9.1 (on page 493). 


3. Show that By+1 defined in (9.12) on page 496 satisfies the quasi-Newton 
equation. (That is, By4is = y.) 


4. On graph paper, draw each triangle produced in the computations for 
the simplex method of Nelder and Mead (Algorithm 9.1 on page 497). 


19This technique is presented in [45] and [44, §5.2.4]. 
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Refer to the computations in Example 9.3 on page 500. Do your plots 
give you insight into how the Nelder-Mead method works? 


. Apply an iteration or two of the steepest descent method to the objec- 


tive function in Example 9.3. To determine the Az, use the quadratic 
interpolation procedure described on page 494. 


. Repeat Problem 5, but alter the objective function to y(x) = (x/100)?+ 


y? —4. What do you observe? (Note that the new problem is essentially 
a scaling of the old problem, and has the same solution.) 


. Consider finding the minimum of f(z, y) = x? — 4x + ry — y +6 on the 


square D = {(a,y) € R?: 0 < 2,y < 3}. Apply one iteration of the 
method of steepest descent with «© = 2,y = 1, to find the point 
(ay) € D that minimizes f(x,y) in the descent direction. 


. Consider the following descent method for finding the minimum of a 


function f : R"” — R! where f € C1(R”). For k= 0,1,2,... 


(a) pe(t) = f (te —tV (xx). 
(b) t, € R is chosen so that yx (tx) = minger vx (t). 
(c) @ep1 = oy — te Vf (ee). 


Prove that f(rp41) < f(x) and f(ax~41) = f(a) if and only if Vf(a,) = 
0. 


In Exercises 9 to 11, consider the minimax problem 


min {max [ie (9.22) 
where a is an interval vector. (That is, we are minimizing the objective 
subject to bound constraints.) This is an example of a nonsmooth opti- 
mization problem, in which the gradient of the objective function does 
not exist. Such problems are common in various applications. Non- 
smooth problems can be transformed to smooth constrained problems 
using Lemaréchal’s technique: We introduce a new variable v, and trans- 
form (9.22) to 


min v 
subject tovu> fil), 1<i<m, (9.23) 
v>—fi(a), 1L<i<m, 
LE. 


. Suppose we wish to find the best fit of the form g(t) = 71t+ x2 in the 


4, norm to the data 
1 


t;| 0 2 


10. 


11. 


12. 
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for —10 <a, < 10, —10 < x2 < 10. Thus, fi(a) = ait; + xe — yj. 
(a) Write down the transformed smooth problem using Lemaréchal’s 
technique. 


(b) Put the resulting linear program into standard form. (Note: You 
may put it into the standard form as in Definition 9.2 on page 504 
or else into the form used for input to a linear programming solver, 
such as linprog from MATLAB’s optimization toolbox.) 


(c) Solve the resulting linear program. (You may either do this by 
hand, or use your favorite linear programming solver. In either 
case, however, explain clearly what you have done.) 


Prove the equivalence of Problem (9.23) to Problem (9.22). That is, 
prove 

(a) if a* solves (9.22), then a* solves (9.23), and 

(b) if z* solves (9.23), then x* solves (9.22). 


In addition to minimax problems, Lemaréchal’s technique can also be 
used to transform ¢; problems. In particular, if the problem is 


min » roy (9.24) 


then the problem can be transformed by introducing variables u;, 1 < 
wm. 


(a) Use Lemaréchal’s technique to reformulate (9.24) as a smooth con- 
strained optimization problem. 
(b) Redo Problem 9, but solving (9.24) instead of (9.22). 


(c) How does the ¢; solution compare to the £,, solution? 
Consider Example 9.7 on page 512. 


(a) Convert the problem to standard form. Hint: You should con- 
sider the upper bounds, e.g. 100A < 100000 as regular inequality 
constraints. 


(b) In standard form, what is m1? 


(c) Write down the values of all of the variables in standard form (in- 
cluding the ones that are equal to zero) at the end points t = 0 and 
t = 1250/0.9285 of the parametrized set of solutions of optimizing 
points. How many of these variables are nonzero? 


(d) How many variables are nonzero at the interior points of the para- 
metrized solution set? 
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13. 


14. 


15. 


16. 
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(e) How does this relate to the Fundamental Theorem of Linear Pro- 
gramming? 


Show that the linear program in Example 9.9 on page 514 is infeasible by 
drawing the solution set to each of the inequalities defining the feasible 
region. 


Use the Simplex method to solve the linear program: 


maximize: 27, + ®%2= Zz 
subject to: 2, — 2%. < 4, 
a+ x2 <15, 

—@+ t2< 6, 

ry < 8, 

x1 > 0, 

v2 > 0. 


(Notice that you need to introduce 4 slack variables.) 


Convert the objective in Problem 14 to a minimization problem by defin- 
ing v(x) = —2a, — x2. Next, write down the Kuhn—Tucker conditions 
corresponding to this problem. 


Set up an algebraic recursion for Example 9.10 (on page 515) analogous 
to the recursion explained in Example 9.11. Solve your recursion to 
verify the results given with Example 9.10. 


Chapter 10 


Boundary- Value Problems and 
Integral Equations 


In this final chapter, we discuss the numerical approximation of boundary 
value problems and integral equations. These often arise in real-world ap- 
plications such as in population dynamics, computational mechanics, and in 
many other applications. 


10.1. Boundary-Value Problems 
In this section, we consider the linear boundary-value problem (BVP) 


oo ee (10.1) 
y(0) =a, yl) = 68 
Equation (10.1) is the one-dimensional Dirichlet problem! for the stationary 


diffusion equation. We study here three classes of numerical methods for 
solution of (10.1): 


(a) the shooting method, 
(b) finite-difference methods, and 
(c) Galerkin methods. 


Several references for the material presented in this section are [9], [83], [39], 
[56], and [83]. 
Consider first an existence-uniqueness result for (10.1). 


THEOREM 10.1 
Suppose that Equation (10.1) can be put into the form 


i Aiea aa ale 


10.2 
y(0) =a, y(1) = B, re 


1A Dirichlet problem is a problem in which values of the function (as opposed to values of 
derivatives) are specified on the boundary of the region for the differential equation. 
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where r(x) > c, > 0, s(x) > 0, for0 < « < 1 and s,g € C™[0,1],r € 
c™*1(0, 1]. Then there is a unique solution y € C™*T?(0, 1). 


PROOF See [56, pp. 92-93]. 


REMARK 10.1 _ If g(x) < 0 on [0,1] and p, g, y € C[0,1], then the 
BVP (10.1) can be put into the form (10.2) with r(#) > 1 > 0, s(a#) > 0, 
s,g € C0, 1], r € C10, 1] and hence y € C?[0, 1]. In particular, set 


f x 
Nas p, so r(x) = exp (/ v(t) , §=—qr, and g = —yr. 
r 0 


REMARK 10.2 The more general problem 
2"(t) + p(t)z'(t) + (tz) =f), a<t<b, (10.3) 
2(a) = a, 2(b) = 8, 


can be converted to form (10.1) by setting y(x) = z(a+ (b-—a)x),0<a <1, 
i.e., by setting t = a+ (b—a)x. 


10.1.1 Shooting Method for Numerical Solution of (10.1) 


In the shooting method, we solve a boundary value problem by iteratively 
solving associated initial value problems. Consider 


(a) y"(x) + p(a)y’ (x) + d(a)y(x) = f(x), 
18 y(0) = a, y(1) = B, (10.4) 


and the associated initial-value problem 


(a) y"(x) + p(a)y'(2) + a(a)y(@) = (2), 
{ (b) y(0) =a,y/(0) =7, (10.5) 


for some unknown y. The theory of solutions to the IVP (10.5) is well-known. 
For example, if p, g, and f are continuous on [0,1], then existence of a unique 
solution of (10.5) is assured. Denote the solution of (10.5) by Y(az,7), and 
recall that every solution of (10.5a) is a linear combination of two particular 
solutions Y“) (2) and Y(?) (2) which satisfy, say 


{ (a) YOO) =a, Y"(0) =0, 


(b) ¥2)(0) =a, Y)' (0) =1. (10.6) 


Then the unique solution of (10.5a) which satisfies (10.5b) is 
Y(e,y) = 1 -)¥P@) +79). (10.7) 
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Now, if we take y such that 
¥(1,7) =(1-y¥%Q) +7 =8, (10.8) 


then y(a) = Y(a,7) is a solution of BVP (10.4). 


REMARK 10.3 There is at most one root of Equation (10.8), that is, 


ee BOM) 
~ YR) -—YO(1)’ 


provided Y?)(1) —- YM (1) 4 0. If Y@(1) - Y (1) = 0, there may be no 
solution to BVP (10.4). A solution exists in this case provided 6 = Y“(1) = 
Y)(1), but is not unique since, by (10.7), Y(x, 7) is a solution for arbitrary 
y. 


Example 10.1 

Consider three boundary-value problems. In the first problem, there exists a 
unique solution; in the next problem, there is no solution; in the final problem, 
there are an infinite number of solutions. 


"+ (Z)?y=0, 0<a¢<1, 
Oe at 


For this problem, 


y (x) = cos = 


2 
y?) (x) = cos oe + sin oo. 
2 T 2 


The unique solution is y(x) = cos $a + sin $2. Notice that 
TT 
yO) — yO) 40,7 = 5. 
’ Hee 0<a<l, 
y(0) =1, y(1) =0. 
For this problem, 
y (x) = cos rx, 
1 
y) (x) = cosra + —sin ra. 
TT 


Notice that 8 4 y (1) and y)(1)—y™ (1) = 0. There are no constants 
Aand B such that y(x) = Acos7xz+B sin rz satisfies y(0) = 1, y(1) = 0. 


538 Classical and Modern Numerical Analysis 
tere Oa 1; 


y(0) =1, yA) =-1. 
Here, 


1 
yY(x)=cosra, y? (x) = cosa + —sin re. 
TT 


Notice that y?)(1) — y™(1) = 0, but 6 = —1 = y (1). The solutions 
are y(x) = costa + Bsin aa for any number B. 


U 


We assume in the following that there is a unique solution to the BVP (10.4). 
The procedure described by (10.5)-(10.8) is therefore valid since y°) (1) cannot 
equal y“!) (1). 

The previous discussion motivates the shooting method for numerical so- 
lution of BVP (10.4). If we can find y such that the solution of IVP (10.5) 
when evaluated at x = 1 is equal to 3, then we have solved the BVP (10.4). 
This leads to the following procedure for the shooting method. 


(a) Replace (10.5) by the system: 


wi (x) = wa(2), 


wh(a) = f(x) — p(x)we(x) — q(x)wi(e), O<a<1, (10.9) 
w1(0) =a, we(0) = z, 


with wi(#) = y(a#) and w(x) = y'(z). 


(b) Solve (10.9) numerically, using, for example, a Runge-Kutta, multistep, 
an extrapolation method, or a software package written by experts.” 
Find an approximation to w (1), say w1(1; z). 


(c) If |wi(1; z) — 6] < € then stop. Otherwise, modify z and return to Step 
(b). 


This procedure corresponds to numerically solving the nonlinear problem 
F(z) = wi(1;z) — @ with solution F(y) = 0. Therefore, a reasonable way 
to update z is using the secant method, ie., 


741) = pM _ F(2)(z = 2D) 1(F(2) = F(z(@-))) 


for i = 1,2,---, where two starting values 2 and z“) are required?. Ge- 
ometrically, the procedure looks as though we are “shooting” and adjusting 
the angle of the gun; see Figure 10.1. There, we are “shooting” over to hit 
the point (1,0). We adjust z until w,(1) = 0. 


2such as can be found in NETLIB 

3 Also, software packages for nonlinear equations can be used. Newton’s method has some- 
times even been used for such problems, obtaining F’(z) by differentiating the entire solution 
process for (10.9) using automatic differentiation techniques. 
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FIGURE 10.1: Illustration of the shooting method. 


Nonlinear and Linear Systems The shooting method is an effective com- 
putational procedure even when the BVP (10.1) (or BVP (10.4)) is replaced 
by a nonlinear boundary-value problem y” (x) = g(x,y, y’). However, for the 
linear boundary-value problem (10.4), linearity can be exploited to dramati- 
cally reduce the number of computations in the shooting method. Recall the 
IVP (10.5). Suppose we numerically solve (10.5) twice, with the two differ- 
ential initial conditions given by (10.6a) and (10.6b) to obtain approximate 
solutions y;(x) and y2(x) that satisfy 


yi(0) =a, ye(0) = a, 
e : a 
y} (0) = 0, y2(0) = 1, 
We now form a linear combination of y:(x) and yo(x) 
o(x) = Ay (x) + (1 — A)ya(a) (10.11) 


such that 9(1) = 6 = Ayi(1) + (1 — A)ye(1). Thus, 
r= (8 — ya(1))/(y1 1) — y2(1)), 


and j(a) is an approximate solution of BVP (10.4). Hence, for the linear 
BVP (10.4), we solve two initial-value problems numerically, forming their 
sum in an appropriate manner to find an approximation to the solution y(z). 
Accuracy of the procedure is determined by the accuracy of the numerical 
methods used to solve the initial-value problems. 


REMARK 10.4 The shooting method can suffer from instabilities for 
certain problems. See [833]. 
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10.1.2 Finite-Difference Methods 
In this section, we write the BVP (10.1) in the form 
L{y} = y" —p(a)y’ —q(2)y=r(z), a<a<b, 
y(a) =a, y(b) = 8, (10.12) 


and impose the restriction g(x) > co > 0, a < a < b. We assume in this 
section that p,q,r € C?[a, b], so there exists a unique solution y € C*[a, b], by 
Theorem 10.1. We divide [a,b] into a uniform mesh, setting 7; = a+ ih for 
4=0,1,---,N+1 where h = (b—a)/(N +1). Recall that 


FIGURE 10.2: A finite difference mesh 


y" (a5) = Wee) — 2h) Fv) + (42, 


y! (x) = y(Xi41) _ y( i— ) + OK"). 


Thus, on the mesh, 


y(wi41) — 2y(wi) + y(@i-1) yu (@in1) — y(*i-1) 
h2 p(x) 
(10.13) 


y(zo) = a, y(an41) = 8, 


fori = 1,2,--- , N. Equation (10.13) suggests the following numerical method 
for solution of (10.12). Let u;, 0<i< N-+1, approximate y(«;) that satisfies 
(10.13) exactly, i-e., 


Ui41 — 2uj + Uj-1 Uit1 — Ui-1 


(a) En{uj} = SHOE pig) HH 


— q(xi)ui = r(x), ae 
(b) Uo = a, uns+i = B, 
for i=1,2,---,N. Multiplying (10.14a) by —45 we obtain 
h? h2 
—> Lp {ui} = —biui-1 + ait — citig1 = —>1(2i), (10.15) 


2 2 
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where 


h? 1 h 1 h 
aj=1lt+ Zulu), bb = 3 c + ea) , and c; = 3 E _ ote.) : 


We can write (10.15) with boundary conditions (10.14b) as a linear system 


Au =r, where (10.16) 
Uy r(x1) bia 
ug 2 {| r(x) 0 
u= rae ; + : (10.17) 
UN r(an) cn 
and 
ay —Cy, 0 0 
—bg ag —C2 
0 —b3 a3 —C3 
A = 
—bn-1 @N-1 —C€N-1 
0 ee 0 —bn an 


To find {u;}*_,, we solve linear system (10.16) with tridiagonal coefficient 
matrix A. If we require the mesh spacing h to be small enough to ensure 


h 
glp(es)| <1 for i= 1,2,--+,N. (10.18) 


then, since g(x) > 0, |a;| = ai, |b;| = b;, and |c;| = c;. Furthermore, since 


for each i, 
Jai] > |bi| + lea for i= 1,2,---,N. 


Hence, A is strictly diagonally dominant,4 and the system (10.16) can be 
solved using direct factorization methods, which are very efficient for tridiag- 
onal systems. 

We now consider the error in approximating {y(2;)}%_, by {ui}... 


4and thus, nonsingular 
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DEFINITION 10.1 We denote the local truncation error 7; by 
Ti = Lrty(wi)t — Ly (ai). 
Assuming y € C*/a, b], 


ee a +h) — 2y(xi) + y(@i —h) | "(«) 


h2 
~ a ly (&) — 2p(as)y(G)] , 


nplvere Sy Ge S [aga aes |e 
We now have the following error estimates: 
THEOREM 10.2 


If the interval width h satisfies (10.18), then 


M, +2P*Ms3 
1203 


where y(a) is the solution to (10.12), {u;}®}' is the solution to (10.14), and 


jus was) £1? ( ); 4=0,1,2,---,N4+1, (10.19) 


+ = (3) 
- anax, Ip(@)], Mz amax, ly (x)|, 


— (4) — i 
M4 anes | (x)|, Qs amin, la(e)|- 


(10.20) 


PROOF Define e; = u; — y(ai), 1 =0,1,2,---,N +1. 
Consider 


Oo 


= Lp{ui} — L{y(xi)} 
= Ln{ui} — Laly(xi 
Ln{ui} — Lrly(wi 


_ Cig1 — 2e¢ + i-1 Cae = ey 
Pe pls OF 


t Lnty(ai)} — Lty(ai)} 


+ 7; 


I 


se 
)}4 


— q(ai)e; + T%. 


Multiplying the above by ie and rearranging, 


2 
aye; = bje;-1 + Crej+1 + ge t= 1,2, eee JN, 
Now let 


e= max |e |, 7= max _|7;|. 
O<i<N+41 1<i<N 
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Then 
2 


h? h 
aille;| < (|b; | + lci|)e + ca =et ue 


since |b;| + |c;| = 1. But |a;] =a; >1+ OQ, so the preceding implies that 
2 g 


h? h? 
(1+ >@.)lei <e+ Te 7=1,2,--- ,N. 


Also, for i= 0 and i= N +1, e9 = en41 = 0, so the preceding also holds for 
i =Oandi=N+1. Thus, replacing |e;| by e on the left-hand side of the 
inequality, we obtain 


h? h? 
gy Me SOT 
Hence, 
T 1 h? 
< < ——|/M. 2P* M3). 
‘Sos onl 4+ 3] 


Derivative Values as Boundary Conditions Let’s consider how other 
boundary conditions can be handled. First, suppose that y’(0) = a@ and 
y'(1) = 6 in place of y(0) = a and y(1) = G. In this case, up and un4+1 
are additional unknowns, and the system has N + 2 unknowns rather than N 
unknowns. Consider 


yi(a) we Wet Aa) = 2a(e) + ule Ae) 
(Ax)? 
which is approximated by 
se i ee 
Ati=0, 
Jor au ae 
and ati=N-+1, 
yl (a) SA 


Thus, we need a value for u_,. But 


y (0) =ax ui — U-1), 


ont 
so we may take u_, = u; — 2ah. Similarly, we take uy+2 = un + 20h. Thus, 


_, 2ui — 29 — 2ah 


2un — 2un41 + 20h 
yl" (0) = Peo Pah yy) = PUN Pune FN 


y"(1) & x (10.21) 
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and (10.14a) now provides N + 2 equations, i = 0,1,--- ,N-+1, for the N+2 
unknowns. (Unfortunately, these approximations to y’(0) and y”(1) are only 
first-order accurate except in the special case when a = 3 = 0 when they are 
second-order accurate, as you will show in Exercise 4.) 

Second, consider the mixed boundary conditions 


my(0) + n2y'(0) = a, 
oe + yey'(1) = B. (10.22) 


Approximations analogous to those discussed previously can also be used in 
this case. For example, we may take 


1 
my(0) + ney'(0) = a & nu + N25, (us Sta) 


to obtain 
fa 2h(a _ mU0) 
U_y = uy — ————.. 
12 
10.1.3. Galerkin Methods 


In this section, we consider form (10.2) of the BVP (10.1) and write it as 


-4 (ra) +s(x)y(x) = f(x), 0O<2<1, (10.23) 
y(0) = y(Q1) = 


Suppose that 0 < rmin < r(x) < rmax and 0 < s(x) < Smax for 0 < x < 1, for 
some constants Tmin, Tmax; ANd Smax. By Theorem 10.1, when f,s € C[0, 1] 
and r € C1(0, 1], the BVP (10.23) has a unique solution y € C?[0, 1]. Now let 
yp € C[0,1] such that y(0) = y(1) = 0 and y’(z) is piecewise continuous on 
(0, 1]. Otherwise, vy is arbitrary. Thus, 


pe S={ue C0, 1]: u(0) = u(1) = 0 and w’(z) is piecewise continuous} . 


Multiplying (10.23) by y and integrating both sides from 0 to 1, we obtain 


4 de (a) ola)ae+ f s(x)y(x)p(x)dae = [ f(a)y(a)dx. (10.24) 


Integrating the first expression by parts and using y(0) = y(1) = 0, we obtain 


| oye eae + f s(ou(ayela)ae = | f(x)p(x)dx for all pe S. 
(10.25) 


Equation (10.25) is called the weak formulation of (10.23). It is nearly equiv- 
alent to (10.23) in the sense that if y satisfies (10.25,) y(0) = y(1) = 0, and 


Boundary- Value Problems and Integral Equations 545 


y € C7[0,1], then y satisfies (10.23). (This is because the steps leading to 
(10.25) can be reversed.) 

The Galerkin Method finds an approximation to the weak formulation (10.25) 
of the form 


M 
Y(x) = pe aivi(x), 


where the {y;(a)}*, are linearly independent functions that vanish at « = 0 
and « = 1, ie., yi(0) = y(1) = 0 fori =1,2,---,M. Let 


Su = span(p1, Y2,° , 9M) Cc S, 


and suppose that Y € Sh, satisfies 


[ro ygtes | s(o)¥ (a)@(e)ae = [ f(x) P(a)dax (10.26) 


for all ¢ € Sy. We now show that (10.26) defines Y(x) uniquely. Letting 
~ = vr fork =1,2,---,M gives 
5 ee : : 
| r(x) ede + | s(x) Y (x) pr (a)dx =} f(x)yr(x)dx ~~ (10.27) 
0 dx dx 0 0 


M 
for k= 1,2,---,M. But Y(xz) = SS ay;(x), so 


Pe [f reretarek(ayde + f° storestererto rar 


i=l 


(10.28) 


0 


Aa =b (10.29) 


with 


REMARK 10.5 _ A is symmetric positive definite and hence nonsingular, 
so we have a unique solution a and Y(«) is uniquely determined. Clearly, A 
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is symmetric. To show that A is positive definite, consider 


v' Av=)~ (« - r(x)y; (x) yp, (x) dave, + wf 


i,k 


1 


=| ra\(o'(a))Pae+ | etn) ?de 
>0 if v#0. 


M 
Since r(a) > 0 and s(x) > 0, where v(a) = S> vj; (x). (Note that 
i=1 


M : 
v (2) = (Soe = 


i=1 


s(x) pi (x) pr (2) 


M 
only if 5> vui~i(x) = c, which implies c = 0, since y;(0) = yi (1) = 0. Then, 


by linear independence, v; = 0 for i = 1,--- ,M.) 


Before continuing, let’s consider an example. 
, Dp 


Example 10.2 


0 


Let $ um be the set of continuous piecewise linear functions defined on the 


partition x9 = 0,2, =h,---, ey = 1, whereh ary Bi th,i=0,1,---,M. 
Note that . 
Su = span{yi(2), y2(x),--- ,yu(x)}, 
where 
L— Lj-1 
4 UH1L LX, 
eke) ~ LTi41 — & 


Uj QU LS X41. 


h z] 
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In this case, A is tridiagonal as well as positive definite. If p(x) = 14a, 
q(x) = 0, f(a) = 100, then the solution is 


Y (x) = —100x + 100 log(1 + x) /log 2. 


Solving (10.29) for a and obtaining Y(x), the following L?-errors were ob- 
tained for various values of M. 


lly — Yll2 = (Jo (y@) — ¥@))? az) ° 
1.780 


2 : 

4 0.464 

8 0.118 
0.029 


1 
Notice that error appears to be proportional to IP h?. 


The basis functions in Example 10.2 are commonly called “hat functions” 
or “chapeau functions.” 


REMARK 10.6 _ Ifthe basis functions have small support, i.e., are nonzero 

on small regions, as in the above example, then the numerical method is 
called a finite element method. The matrix A is then sparse, in the above 
example, tridiagonal. The finite element method has become a popular and 
important method for solving partial differential equations. However, we are 
not restricted to using such basis functions. We could choose, for example, 
trigonometric functions, such as 


Sv = span(sin7az,sin27az,--- ,sin Mra). 


We could also choose polynomials, and a reasonable choice of Sur could then 
be 


Su = span {x(1 —2),27(1—2),--- a4 — «)} : 
(Recall that y,(0) = yx (1) = 0 for each k.) 


We now obtain a bound on the error. We introduce an “energy” functional. 
(This is related to the energy of a physical system for certain problems.) Let 


F(u) = | (Sri? + 58(2)u2(2) — F(e)u(e)) dx. (10.30) 
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Let y be the solution of (10.23) and w be an element of S. Let e = w — y. 
Then 


+ s(x) (e?(x) + 2e(x)y'(x) + y?(x)) 
— f(x) (e(@) + y(a)) de. 


since y satisfies the weak formulation, several terms sum to zero, and we 
obtain 


F(w) = Fly) + llell, (10.31) 
where || - || 7 is the “energy” norm defined by 
2 a! Ha\y2 1 12 
llellp = gr (a) (e (x))" + 5° (x)s(x)| da. (10.32) 
0 


This is a legitimate norm since |le||% = 0 only if e = 0. Similarly, we can show 
for all W € Sy that 

F(W) = F(Y) + |lE|lF, (10.33) 
where FE=W-Y. 

Equations (10.31) and (10.33) tell us that y minimizes F' over all u € S and 
Y minimizes F over all U € Sy. That is, F(y) < F(w) for all w € S and 
F(Y) < F(W) for all W € Sy. Hence, F(Y) — F(y) < F(W) — F(y) for all 
We ee 

Now, letting w in (10.31) be Y or W, we obtain 


F(W) = Fly) + |W — lz 
and 
F(Y) = Fly) + |l¥ — yllz- 


Hence, 7 
IY — yll® < |W —yl|% for all We Sy. (10.34) 


Thus, Y is the best approximation in $y, to y in the norm ||.||r, that is Y is 
closest to the solution y for any function in Sy in the energy norm. 


REMARK 10.7 With a considerable amount of additional work, we can 
show that if y € C?[0, 1], then the error in the Lg-norm is proportional to h?, 


ie., ||Y — yllo < coh”, where ||g|l2 = (i pP(x)de) *, 0 
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REMARK 10.8 — Galerkin methods can alternately be described in the 
following manner, let Ly = f, x € Q, for example, be a boundary-value 


problem with By = 0 for x € 0, where OQ denotes the boundary® of Q. Let 
Sy be a finite-dimensional subspace of a Hilbert space S, where y € S$. Then 
N 


LUyn — f =e where Uy = 3 cipi and Sy = span(y1, ~2,°-:, yn). To find 


t=1 
Un, we make e orthogonal to y,, k = 1,2,---,N. That is, we find the c; to 
make 


N 
(zy cae. 9%] —(f,¢n)=0 fork=1,2,---,N. 
i=1 
Notice that this approach results in system (10.27). 


Let’s again consider the Galerkin method but in a functional analysis set- 
ting. We define the following Hilbert spaces: 


DEFINITION 10.2 


H1(0,1) = {ue L7(0,1): =< EE’ (O.1)}; 
H}(0,1) = {ue H'(0,1): u(0) =u(1) = 0}, 
where L7(0,1) is the set of those Lebesgue measurable functions u(a) on 


(0,1) that satisfy fo u?(x)dx < oo. With the above sets, we define the inner 
products: 


(i) On £2(0,1) : (v,w) - | isi eee: toll = Cay. 


(ii) On H1(0,1) or H5(0, 1): 


(v,w)1 =f veayu(ayae + [ ooyw' aa 


Ile? = [ Pears [ora = (v,v)1. 


REMARK 10.9 With the above inner products, it is possible to show 
that H'(0,1) and Hj(0,1) are complete inner product spaces, i.e., Hilbert 
spaces. 


5In the boundary value problems we have been considering, 2 = [0,1], and 0Q consists of 
the points 0 and 1. 
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REMARK 10.10 Another way to define these spaces is the following. 
Let V™ = {u € C™(0,1) : |lull2, = ofp (u(a))* dae < 00}. Then define 
k=0 
L?(0,1) = completion of V° with respect to norm ||.||9 and H+(0,1) = com- 
pletion of V' with respect to norm |j.||;. (Hence, V° is dense in L?(0,1) and 

V! is dense in H1+(0, 1).) 
On Hj(0,1), we define the bilinear form B(-,-) by 
1 1 
Biv, w) = | r(x)’ (x)w" (x)dx +f s(x)u(x)w(ax)dr 
0 0 
= (rv',w’) + (sv,w) for v,w € H5(0,1). (10.35) 


Now note that if y is the classical solution of (10.23), then 


1 1 
By, v) =| ryvide+ f syvdx 
0 0 
1 1 
= vals f (ry'Yoae + [ syvdx 
0 0 
1 
= [ (ry)! +50) ode 
0 


= fudz 
0 
(f,v) 
That is, B(y,v) satisfies 
B(v,y) = (f,v) for every v € Hj(0,1). (10.36) 


Now suppose that (10.36) is satisfied for every v € H4(0,1). Our question 
is: can we find a function y € Hj(0,1) which is unique solution of (10.36)? 
If such is the case, we call y the generalized solution of (10.36) in Hj(0, 1). 
Furthermore, since C3[0, 1] C Hj (0, 1), if the solution to (10.36) is sufficiently 
smooth, then the solution will also be the classical solution. To prove existence 
and uniqueness of the solution y to (10.36), we will use the Lax—Milgram 
Lemma: 


THEOREM 10.3 
(Laxz—Milgram Lemma) Let H be a (real) Hilbert space and let B(-,-) : H x 
H —R'! be a bilinear form on H which satisfies 


(a) |B(®,¥)| < ci||®||||G]| for all 6, U © H (boundedness), 
(b) B(®,®) > col|®|/? for all®e€ H — (coerciveness), 
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where c, and co are positive constants independent of ®,V € H. Let F: H — 
R! be a given (real-valued) bounded linear functional on H. Then there exists 
a unique u € HT satisfying 


B(u,v) = F(v) for all v € H. (10.37) 


IFA) 


Moreover, ||u|| < ||F'||/c2, where ||F|| = —_—~. 
feu,f4o IIf\l 


Thus, to prove existence and uniqueness of y that satisfies (10.36), we will 
show that the conditions of the Lax—Milgram Lemma are satisfied. First, let 
F(v) = (f,v). Then, F is a bounded linear functional on H}(0,1), since 


|F| = sp M60 
ve HA ,v£0 Ilv|| 1 
fll 
leh 


~ loll 
Now consider B(-,-) defined in (10.35). Clearly, B(-,-) is a bilinear form on 
Hj x Hj. Furthermore, for v,w € Hj we have 
1 1 
|B(w,v)| < rae ff ju! ede + Sma f |wl||v|da 
0 0 
STmacl|w'llolle'llo + Smacl|wllollvllo 


by the Cauchy—Schwarz inequality. Thus, |B(w,v)] < (Tmar+$maz)||w]|1||vll1, 
so condition (10.37a) of Theorem 10.3 is satisfied. Now, for v € Hj(0, 1), 


1 1 
B(v,v) = | ro Par + f sv"dax 
0 0 
1 
2 Tine’ lo = rmin(Slle'llo + 5lle'llo)- 


Then, since v € Hj(0,1), Poincare’s inequality yields 
i 12 i 2 
B(v,0) > drain (Jol + O18 


for some constant c. Thus B(v,v) > c9|lv||7, where cp = 37min min(1, 4), 
and condition (10.35b) of the Lax—Milgram Lemma is satisfied. (Poincare’s 
inequality is presented and proven later as Theorem 10.5 on page 553.) There- 
fore, by the Lax-Milgram Lemma, there exists a unique y € Hj(0, 1) satisfy- 
ing (10.36). Moreover, ||y||1 < cl|f||o for some constant c. 
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We now approximate y, the solution of (10.36), by its Galerkin approxima- 
tion yp, from a finite-dimensional subspace S$), of Hj(0,1). The approximate 
solution yp, is required to satisfy 


B(yn, Un) = (f, Un) for all vpn € Sp. (10.38) 
The existence and uniqueness of y, € Sp, is guaranteed by the Lax-Milgram 


Lemma, since 5S), is also a Hilbert space. Now consider the error in the ap- 
proximation. We have the following result. 


THEOREM 10.4 
Let y satisfy (10.36) and let y, satisfy (10.38). Then 


Cl ? 
ly — yall < (1 + =) ie Sel 
c2 7 xESn 


where ci = Tmax + Smax and co = aii min(1, +). 


PROOF Let y € S),. Then 


lly — yall < Ily — xlla + Ix — yall. (10.39) 


In addition, 


callx — yall? < Bx — yas xX — ya) = Bx -—y +9 Yas X— Yh) 
= B(x —y, x — yn) + Bly — yas X — Yn). (10.40) 


Now, since B(yn, V) = F(V) = Bly, V) for all UV € S;,, we have 
Bly — yn, V) =0 for all U € Sp, 
In particular, for V = y — yn, B(y — yn, X — Yn) = 0, so (10.40) implies 
callx — ynlli < B(x —y,x— yn) < eallx — yllallx — yalla. (10.41) 


Therefore, 


Cy 
lx — ynlla < —Ilx - ylla- 
C2 


Then, (10.39) implies 


ly aolh < (142) xa 


for any x € Sh, ie., 


C1 
— <{1+—) inf - ‘ 
ly-anlls (142) int tix wh 


Boundary- Value Problems and Integral Equations 553 


COROLLARY 10.1 
If the family S), of subspaces satisfies lim inf ||y— x||1 = 0, then lim ||y — 
h-0 xESh h—-0 


Ynili = 0. 
Example 10.3 
It can be shown that if $7, is the space of piecewise continuous linear approxi- 


mations considered in Example 10.2 (the “hat function” example on page 546) 
and y € H?(0,1) Hj (0,1), then 


lly — yall <crAllylle and lly — gallo < coh" lull 
where , ‘ 
Iw [2 = | w(e)de, |jw||? = f (w(x) + (w!(x))?) dey, 
and 


wid = f [w? (ax) + (w"(a))? + (w" (2))?] de. 


THEOREM 10.5 
Poincare’s inequality in one-dimension: For f € Hj(0,1) then || fllo < || f’llo- 


Proof: Since f € Hj(0,1), f(x) = fy f’(x)dx. Using the Cauchy-Schwarz 
inequality in L7(0, 1), 


sa < f “LF @lae 
< ([ ws)" ([ i"(e)Pae) 
< ( [ Meas) | =([7'lo 


(/ near) | < (/ iroiRae) = |If"llo- 


REMARK 10.11 In practice, Corollary 10.1 indicates that by reducing 
the step size h, the finite element approximation in the subspace $), can be 
made as close as desired to the exact solution. In other words, decreasing the 


Thus, 
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step size h (or refining the mesh) yields a better approximation. This corre- 
sponds to the h version of the finite element method [92]. Alternatively, one 
may enrich the subspace S; by employing a higher-order polynomial degree 
approximation instead of mesh refinement. This corresponds to the p version 
of the finite element method [92, 81]. In the last several years, there has 
been a significant amount of research in employing both mesh refinement and 
polynomial degree refinement simultaneously to obtain the approximate finite 
element solution (which is called the hp-version) with numerous applications 
in a variety of areas [81, 82]. 


10.2. Approximation of Integral Equations 


Integral equations occasionally arise in applications.® In this section, we 
consider numerical methods for solving Volterra and Fredholm integral equa- 
tions of second kind. A Volterra equation of second kind has the form 


f(t) = K(t,s, f(s))ds = g(t), O<t<T, (10.42) 


and a Fredholm integral equation of second kind has the form 


'P 
f(t) -{ K(t,s, f(s))ds = g(t), O<t<T. (10.43) 


Several references on the numerical treatment of integral equations are [7], 
[20], [53], and [67]. 


10.2.1 Volterra Integral Equations of Second Kind 


In this section, we study nonlinear Volterra equations of the second kind 


f(t) = g(t) + [xe s,f(s))ds, O<t<T. (10.44) 


We will assume that the kernel K(t,s,u) satisfies a Lipschitz condition with 
respect to the third argument. 


10.2.1.1 Existence and Uniqueness of Solutions 


Recall: 


6In fact, many problems that can be posed as differential equations can also be posed as 
equivalent integral equations. 
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DEFINITION 10.3 K(t,s,u) satisfies a Lipschitz condition with respect 
to the third argument if there is a constant L > 0 such that 


|K(t, 8, y) —K(t,8;2)| < Ly - 2| 


for t,s € [0,T], where L is independent of t, s, y, and z. 


THEOREM 10.6 
Assume that the functions g(t) and K(t,s,u) are continuous in0<s<t<T 
and —co <u< o and the kernel satisfies a Lipschitz condition of the form 
given in Definition 10.3. Then (10.44) has a unique continuous solution for 
all finite T. 


PROOF We define successive iterates by 


+f K(t, 8, fn—1(s))ds (10.45) 


for n = 1,2,3,..., with fo(t) = g(t). We subtract from (10.45) a similar 
equation with n replaced by n — 1. Then 


Pfu n= fi {K(t,s, faa(s)) — K(t, 8, fna(s))}ds. (10.46) 
Let Yn(t) = fr(t) — fnr—i(t) with yo(t) = g(t). We see that 
Int) = Di eilt)- (10.47) 


Also, (10.46) and the Lipschitz condition give 


t 
jen) f ien-als)ls (10.48) 
We now show by induction that 
L nm 
noe 0<t<T, n=0,1,--- (10.49) 
Nn: 


where G = max, \g(t)|. Clearly, this is true for n = 0. Suppose it is true for 
t 


n—1. Then, by (10.48), 


G(L G(Lt)” 
< s< , 
jontd| <b PSE as < 


n! 


This bound shows that the sequence f,,(t) in (10.47) converges, and we write 


t)= vile). (10.50) 
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We now show that this f(t) satisfies (10.44). The series (10.50) is uniformly 


G(LT)' 
convergent, since the terms y;(t) are dominated by ae Hence, f(t) 
i 


exists and is continuous. To prove that f(t) defined by (10.50) satisfies the 
original equation (10.44), set 


f(t) = fr(t) + An(t). (10.51) 
From equation (10.45), 


f() — An(t) = g(t) + i COU 


a K(t, s, f(s))ds 
=ano+ f [Kx (t,5, f(s) -— An—1(s)) — K(t,s, f(s))]ds 
0 


Applying the Lipschitz condition gives 


@- [ Ktss ))ds 


where ||A,—1|| = dnax, |A,—1(s)|. But Jim |A,,(t)| = 0, so by taking n large 


sO 


<|An(t)|+Lt||Anall, (10.52) 


enough, the right member of (10.52) can be made as small as desired. It 
follows that the function defined by (10.50) satisfies: 


fOQ=9() + ‘i: u(t, s, f(s))ds 


To show uniqueness, we assume existence of another continuous solution f(t). 
Then, 


If — FO) = if {K(t,s, f(s)) — K(t,s, f(s) }ds (10.53) 
< fin |f(s) — f(s)|ds < BLt, (10.54) 
since | f(s)— f(s)| must be bounded by some constant B. Thus, | f(t)— f()| < 


BLt. By replacing |f(s)— f(s)| by BLs in (10.54), we obtain that | f(s) — f(s)| 
must be bounded by B(Lt)?/2. Repeating this process n times leads us to 


cae 


If) -fO|<B 


for any n. We therefore conclude that f(t) = ji (t). U 


REMARK 10.12 If | 242! < 1 for 0 < s,t < T and -co <u < ov, 
Ou 


then K satisfies a Lipschitz condition of the form given in Definition 10.3. 
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10.2.1.2 Numerical Solution Techniques 


We now consider numerical solution of the Volterra equation of second kind 
(10.44). We assume we have a numerical integration rule of the form 


nh n 
[eat ~ AY wniolt), (10.55) 
0 i=0 


where y(t) is any continuous integrand and w,,; are integration weights. (For 
example, if t; = th, Wno = Wnn = 4, and wy; = 1 for i = 1,2,---,n—-1, 
Equation (10.55) is the composite trapezoidal rule.) Using (10.55) to replace 
the integral in (10.44), we are obtain the iteration equation 


Fy =g(tn) th So wniK (tnti,Fi) for n=1,2,3,---, (10.56) 
i=0 
with Fo = f(0) = g(0), where F,, denotes the approximate value of f (tn). 


REMARK 10.13 _ If the wy; are bounded and h is sufficiently small, then 
F,, for n = 1,2,3,--- can be calculated from (10.56). That is, if we do fixed 
point iteration Ft) = Gor) for k = 0,1,2,... with G: R > R and 
G equal to the right member of (10.56), then the fixed point iteration will 
converge for h sufficiently small and the w,; bounded. (You will prove this 
later in Exercise 9.) 


We now analyze the error in this numerical method for the special case that 
the composite trapezoidal rule is used to approximate the integral in (10.44). 
The following lemma will be useful. 


LEMMA 10.1 
Suppose that €9 = 0 and 


n-1 
én < Bh? + Ah) e; for n=1,2,3,---, (10.57) 
i=0 
where A,B > 0. Then, 
6, < Bh?eA” for i =0,1,2,---. (10.58) 


PROOF Inequality (10.58) is clearly true for n = 0. Suppose that it is 
true for 1 = 0,1,2,---,2—1. We will show that it is true for 7 = n and thus 
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true for all 7. By (10.57), 


n-1 
én < Bh? + ARS 
i=0 
2 3 = Ahyi 2 3 ers 
eo Anh Anh 
< Bie Ae a = Bh? + Bh?e4™ — Bh? = Bh? eA”. 
Inequality (10.58) is thus true for all 7. (Note that 4, < 4 for x > 0.) 


THEOREM 10.7 
Assume that equation (10.44) has a unique continuous solution f and the 
kernel satisfies a Lipschitz condition of the the form given in Definition 10.3. 
Let K(t,s) = K(t,s, f(s)) and assume that K € C?({0,T] x [0,T]). Then, 
assuming that 0 < hL < 1 and the composite trapezoidal rule is used in 
(10.56), 

e Mah?T or 


tor for n=0,1,2,---,T/h. (10.59) 


En 


PROOF Let €, = |F, — f(tn)| for n = 0,1,2,---. Subtracting (10.44) 
from (10.56) gives 


eee Cae | " K(tn 8, f(s))ds — h > win K (tn, tis Fi) 
1=0 
= [Kets £(s))ds— BY eK bast C80) 
1=0 


j=0 


Thus, 
tn - n a n 
En < / K (tn, s)ds — hS wi (tn, 8) +hLS~ |wniles 
0 i=0 i=0 
OK (tn,s)| h2t, 1 Lak, 
ee patie ie ee = . 
< foe 5x2 7) + 5 hLen + hb » Gi; 


where the first term is due to the error in composite trapezoidal rule. Hence, 

letting 

2K (tn, 8) 
Os? 


Mz = max 
O<s,t<T 
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and using the assumption AL < 1, 


Moh2T & 
ey 7 tan OS 


Now, using Lemma 10.1 with B = “2" and A = 2L, we have 


< Mah?T ain 


<—, for 4=0,1,2,--«, 


Ej 
so 


Mo2h?T 
<<. 
~ 6 


En e747 for any 0O<n< T/h. 


REMARK 10.14 Theorem 10.7 says that the order of convergence of 
this method is O(h?). Consider the problem 


f(t) =e - | elt-9) f(s)ds, 


with exact solution f(t) = 1. Calculated errors using this method are given 
below. 


Errors = |F; — f(ti)| 
t h=0.1 h=0.05 | h = 0.025 


0:2 |) A710 4 10 | LO RA? 
0.4 | 3.3 x 1074 | 8.3 x 107° | 2.1 x 107% 
0:6:| 5.0: « 1074) 1.34074 | 34 10? 
0.8 | 6.7 x 1074 | 1.7 x 1074 | 4.2 x 1075 
1.0 | 8.3 x 1074 | 2.1 x 1074 | 5.2 x 107° 


The calculated results illustrate that the error is proportional to h?. 


REMARK 10.15 More information about numerical methods for Volterra 
integral equations can be found in [53]. 


10.2.2. Fredholm Integral Equations of Second Kind 


In this section, we study linear Fredholm equations of the second kind 


f®H=9g0) + i K(t,s)f(s)ds for 0<t<1. (10.60) 
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We begin with an existence and uniqueness result for the solution of (10.60). 


THEOREM 10.8 


Assume that g € C[0,1] and kK € C((0,1] x 


Then, Equation (10.60) has a unique continuous solution on [0,1]. 


M= max |K(t,s)| <1. 


O<t,s<1 


(0, 1]). In addition, assume 


PROOF We define a sequence of continuous functions on [0,1] by 


+f K(t, s)xo(s)ds 


that is, 


for n = 1,2,3,---, 


where 


with Ky(t,s) = 


nlf, s) pigs gi K( t; ti)K (t1, to). K (te, ts) + -K(t n—-1,8 s)dtydtg--- 


which leads to: 


1 
Ky(t, 8) = K,(t,u)Kn—p(u,s)duy  l<p<n-1. 
0 


K(t,s). It follows from (10.63) that 


where Xo(t) 


+f K(ts)a(s)ds + | Ko(t, s)g(s)ds 
+/ Ky(t, 8)g9(s)ds 


1 
Kn(t, s) =| K(t,u)Kn_-i(u, s)du, n>2 


Formula (10.62) can also be written 


=f ie K(t, 8)an—1(s)ds 


= g(t). It can be easily seen that 


9 


’ 


(10.61) 


(10.62) 


(10.63) 


dtn—1, 


(10.64) 


(10.65) 


(10.66) 
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co 
But the series > K,(t,s) is uniformly convergent with respect to (t,s) € 
j=l 


(0, 1] x [0,1]. Indeed from (10.64), 


[Kn(t,s)| <M", nel, (10.67) 


lo) 

which shows that the series }> K,(t,s) is dominated by the series with j-th 
j=l 

term M/, that is, a convergent geometric series. Let’s denote 


R(t, s) = S> K;(t, 8). (10.68) 


It follows that R(t,s) is a continuous function on [0,1] x [0,1]. From (10.62) 
and (10.67), 


\an(t) — a_i (t)| < NM”, (10.69) 
where N = sup |g(t)|. Inequality (10.69) shows that sequence {x,(t)}°2, is 
0<t<1 
uniformly convergent on [0,1]. Taking (10.66) and (10.68) into account, we 
obtain 


1 
x(t) = g(t) +f R(t, s)g(s)ds, (10.70) 
0 
where x(t) is the limit of sequence {x,,(t)}°2.,. From the construction of this 
sequence, it follows that x(t) is the solution of (10.60). Consequently, (10.70) 
gives a solution of (10.60). 


To prove uniqueness, assume that y(t) is also a solution of the same equa- 
tion: 


1 

y(t) = g(t) +f K(t, s)y(s)ds. (10.71) 

0 
Using the recurrence formula (10.61) for x,(t), we obtain 
1 
ly(t) — an(#)| < | | (t, 8)| |y(s) — tn-1(s)|ds 
1 
<M [ Iy(s) -ena(s)las. 
0 


Consequently, denoting a, = sup |y(t) — x,(t)|, we obtain 
O<t<1 


An S Man-1 Les Mao. 


But this shows that a, — 0 as n — oo, that is, y(t)= lim x,(t)=<2(t). 
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10.2.2.2 Numerical Solution Techniques 


We now consider numerical solution of (10.60). We assume that we have a 
numerical integration rule of the form 


1 N 
| g(t)dt ~ h}* wiplti), (10.72) 
a 1=0 


N 
where t; = ihandh=1/N, and > w; = N, with w; > 0 fori = 0,1,2,--- , N. 
i=0 


Approximating the integral in equation (10.60) by the numerical integration 
scheme (10.72) suggests the following approximate method of solution: 


ttn owk (ti, t;) (10.73) 


for i = 0,1,2,---,N, where F; = ae and t; = th = i/N and w, > 0 for 
j =0,1,2,---,N. 

We now assume that the quadrature rule has accuracy of order p and that 
K and f(t) are sufficiently smooth for this order of accuracy to be attained. 
Specifically, it is assumed that there is a constant c > 0 such that 


o<eel x|f Ore Yowki (x, ts) f(ty)| S ch”. (10.74) 
We now have the following convergence result: 


THEOREM 10.9 

Assume that the conditions of Theorem 10.8 hold, so equation (10.60) has 
a unique continuous solution. Also, assume that inequality (10.74) is valid. 
Then, 


= ° chP 
(>: hw; (Fi — no?) aS ee Te (10.75) 


PROOF Letting ¢ = ¢; in (10.60), subtract this equation from (10.73) to 
obtain: 


1 
-| K(t,s)f(s)ds + Loki (ti,t)) (10.76) 
0 
Letting €; = F; — f(t), 


N 1 
E= hS” w;K (ti, ta) f (ty) -{ K(t, s)f (yee Yow (ti, tj) Ej. (10.77) 
j=0 0 


Boundary- Value Problems and Integral Equations 563 


Squaring (10.77), multiplying by hw;, and summing from i = 0 to N gives 


2 


Shed = Sm B+ Dok (ti, t;) . (10.78) 


where 
Ey, = Sh (ti, t;) ~ f Ktis 96) s)ds < ch”. 
Hence, 
Yoh = Yok +23 Sh (ti, t;) 
2 
+ Sohn Sh (ti,tj)e; |. (10.79) 
Letting 


NF 


N N 
= > hw,E?, a= 3 3 hw;hw;K?(ti,tj)) , e@= Se hwy? 
i=0 i=0 


1=0 7=0 


and applying the Cauchy—Schwarz inequality gives 


e? < EB? 4+ 2aFe+a%e*. (10.80) 
However this inequality implies that « < ~~. Noticing that EF < ch? and 
a <M gives the desired result. ] 


Galerkin Approach Now consider a Galerkin approach for approximating 
the solution of the linear Fredholm equation of second kind (10.60). Let’s 
write (10.60) in the form 


Lf = g(t) for ¢e€ [0,1], (10.81) 
where 


Lf = f(t) - f Kort) 


Let £00, be the Hilbert space of Lebesgue measurable functions y(t) such 
that i y*(t)dt < oo. Let Sy Cc L7(0,1) be a finite-dimensional subspace 
of £7(0, 1) un {yi(t)}%_, a basis for Sy. One way of defining a Galerkin 
method is as follows. 
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N 
Let Fy € Sy be an approximation to f € £*(0,1), where F(t) = > cipi(t) 


1=1 
and c; are to be determined. Then 
LFw —g =e € £°(0,1). (10.82) 
To find c;, 2 = 1,2,--- , N, we make the error e orthogonal to v1, Y2, °°, YN- 
That is, 

(LFw — 9,;) =0 for 7 =1,2,---,N, (10.83) 
where ( = fo ul x)dx. SE UHNe into Fy, we obtain the following 
linear ae to fas S Ci’: 

N 
YS ei(Lyi,%3) = (9,95) for = 1,2,---,N. (10.84) 
i=1 


To show that F(t) is a good approximation to f(t), we use the following 
lemma. Lemma 10.2 can also be used to show that the coefficient matrix of 
(10.84) is positive definite and hence nonsingular assuming that the kernel is 
symmetric. (See Exercise 11.) 


LEMMA 10.2 
Suppose that the conditions of Theorem 10.8 hold so that (10.60) has a unique 
continuous solution. Then 


(Ly,~)>(1-M)|lyll? for any yp € £7(0,1). (10.85) 


PROOF 


ee o- ff Kase )(t)dsdt 
> ( a-f (fe K(t, 8) is) (f tvs) - lp(t)|adt 
wa g" (t)dt (/ [ 2 ssi) 


2 (yy) —- M), 


by applying the Cauchy—Schwarz inequality to obtain the first and second 
inequalities in the preceding computation. 


By (10.81), the solution f € £7(0,1) satisfies 


(Lf—g9,9)=0 forall y € £°(0,1). (10.86) 
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Hence, (10.83) and (10.86) give 
(Lf —LFy,®n)=0 forall ®y € Sy C £L°(0,1). (10.87) 


Combining the above results gives 


(1— M)I|f — Fill? < (Lf — Fw), (f — Fw) (10.88) 
= (L(f — Fn), (f — ®n)) 
< (L(f — Fw), L(f — Fw))2||f — By) 
< (1+2M + M?)?||f — Fy)Il lf — Qn), 


using (Ly, Ly) < (1+2M + M?)||y||? for y € £2(0,1). Hence, 
lf — Fn|| < ell f — ®nI| for ®y € Sy C £°(0, 1), (10.89) 


where é= (1+ 2M + M?)3/(1—M). 


REMARK 10.16 Inequality (10.89) says that if the finite-dimensional 
subspace Sy is chosen so that ont, ||f — ®n|| — 0 as N — ov, then 
N N 


sim lf — Fivl| = 0. U 


REMARK 10.17 The Lax—Milgram Lemma can alternately be applied 
to yield result (10.89). 


Example 10.4 
1 
fit) — [ K(s\f(o)ds=g(t)=14t, OS ESI, 
0 


where 


Let 


N 
Fy(t) = yy cypi(t), 


where the y;(t) are the continuous piecewise linear hat functions. Specifically, 
let N = 3. Solving for F3(t) yields 


F3(t) = 91 (t) + 1.6522¢9(t) + 2.7354¢3(t), 
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where 
s—t 1 
vilfi)==5-, O<t<5, 
Dy 2 
t 1 
mae) OS Ss 
= 2 

1°? =<t<l, 

2 2 

t-—5 1 
y3(t) = —2, 5 Sts. 


The exact solution of this problem is f(t) = e’. Values for the numerical 
solution are compared in the following table. 


10.3. Exercises 


1. Consider solving the boundary value problem 


dy _ by 
using the shooting method. Transform the problem into a system of two 
d 
first-order equations by introducing a variable z = =. 
ob 
(a) Assume z(2) = 0 and employ Euler’s method with step size h = 1 


to find y(4). 

(b) Repeat part (a) with z(2) = 3. 

(c) Using the guesses for z(2) and the corresponding results for y(4) 
in parts (a) and (b), interpolate to find the value of z(2) to obtain 
the desired result y(4) = 8. 


2. Consider the two-point boundary-value problem 


y" (x) — p(a)y'(x) — a(a)y(@) = r(@), O< a <1, 
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with y(0) = y(1) = 0. Assume that g(z) > a > 0 forO <a <1. 
Consider the difference scheme 


(j+1 — 2y5 + Yyj-1) ei) 
an eC) er ames — q(x5)yj = 7(2;) 


for 7 =1,2,---,N—1, with yo =yn =0, 2; =jhandh=1/N. 


(a) Determine the matrix A so that the above difference equations can 
be written as the linear system 


Ay =h’r 
with 
oe ae una and r = [r(x1),r(x2),--: ,r(tn-1)]". 
(b) Prove that if 
2 oeeei lp(x)| <1, 


then the (N —1) x (N—1) matrix A is strictly diagonally dominant. 
. Consider 


y” —plx)y’—qa)y=r(z), a<ar<d, 
ylaj=a, y(b) = 8, 


with g(x) > cz > 0. Consider the difference equations 


Wit. — 2uj + Uj Uj41 — Uji 
(we 2 Fw) a p(a,) ew) — q(xj)u; = r(x) 


fori =1,...,N, with up = a, un+1 = 2, where 


he 
uy=atih, h= eae 


Assume that y € C*[a,b]. Prove that for h sufficiently small, 


hM, + 6M2P* 


i—Y(ai)| <A 
maxi ~ u(as)| < (AES 


= () 
where M, bs ly\? (a)|. 


. Show that the approximations given in (10.21) (on page 543) are accu- 
rate to second order when a = 3 = 0, but are only first order otherwise. 


. Consider the boundary-value problem 


-4 = F(a), y(0) = y(1) = 0, 
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with f € C[0, 1], so that y € C?[0, 1] uniquely exists. Consider approxi- 
mating y(2) by 


M-1 
— ye api(x) € Su, 
i=l 


where Sz is the set of continuous piecewise linear functions on a par- 
tition r9 = 0, 4,1 = h, x2 = 2h, ..., ey = 1, andh = a If the 
approximation is by the Galerkin method, then prove that 


1 1 
_ 2 = / _yl 2 < W ' 
IV alle = 5 f W'@) -¥'@P de < ch max, ty") 


. Apply the Galerkin method to approximate the solution y(x) to the 


two-point boundary-value problem 


Let 
y(x) © Y(x) =e, sin(wx) + co sin(272) 


and find c, and c2. Also find the exact solution y(x) and compare Y (x) 
to y(x) at « = 0.25, 0.5, and 0.75. 


. Show that 


f(@®)=14 i won 


has a unique continuous solution for all t. 


. Prove Remark 10.12 on page 556. 


. Prove the assertion in Remark 10.13 on page 557. (Hint: Use the con- 


traction mapping theorem.) 


Consider the Fredholm integral equation 


t +f K(t, s) f(s) ds 


for 0 <t < 1, where maxo<i,s<1|K(t, s)| = 3. Suppose that 


[ Kt.o1 Jae WY (ti, t;) f (t;)| < ch? 


11. 
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for each i where t; = ih. Also suppose that h < $,0 < w; < 1 for all j, 
and Seer hw; = 1. Let 


N 
j=0 


fori =0,1,...,N. Prove that max |f(t;) — Fj| < 2ch?. 
j 
Use Lemma 10.2 (on page 564) to show that system (10.84) (on page 564) 


has a positive definite coefficient matrix and hence is nonsingular. As- 
sume that K(t,s) = K(s,t), ic. the kernel K is symmetric. 


Appendix A 


Solutions to Selected Exercises 


Solutions from Chapter 1 
l(c). Since f € Cfa, b], we have 


min f(x) < f(v@;) < max 

es ) > P( i) ~ xé[a,b] 
for 7 = 1,...,n. Also, multiplying the inequality by w; and summing 
over j from 1,...,n, we get 


min f(x) <)> w;f(«;) < max f(z), 
j=l 


x€[a,b] x€[a,b] 


since w; > 0 and Se w; = 1. The result then follows from the 
Intermediate Value Theorem. 


5. See Example 1.3. 


7. It is easy to verify that a + (b+ c) = 0.327, which is not equal to 
(a+ b) +c= 0.326. 


9. Let Sp =a’ b= x1 23 + x24. Note that 
viv; =@4(1+e) r3(1+€3)(1 + 413), 


where €13 < 6 corresponds to the error due to the multiplication. Simi- 
larly, 
L504 = L2(1+ 2) v4(1 + €4)(1 + € 24), 


where €24 < 6 corresponds to the error due to the multiplication. We 
then have 
bre Soy < She”. 


16. |2* — y*| = |c(1+rz) —y(1+ry)| > |e —y| — (|z|+|y|).R. Therefore, to 
ensure z* # y* we must have R < |x — y|/(\a| + lyl). 


«(fg)'(x)| _ |@ f(a) g(@) +29'(x) f@) 


f(x) g(a) 


19 Bie) =| Tpa) Ge) 


< kg (x) + kg(2). 
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xr a 


21. From Theorem 1.8 and using | < $10-**1, we have 


1 1 
< {0 =210-e, 
= 2 


Solutions from Chapter 2 


5. (a) We use induction. Clearly, x* € [xo0, yo], f(vo) < 0, and f(yo) > 0. 
Assume that x* € [Ce—-1, Yk—-1] and f (xR-1) = 0, f (Yr-1) > 0. If 


f (44) <0, then 7, = wet and yx = yr-1. Thus, 


f(ae) <0 and f(yg) > 0 and 2* € [ae, yx]. If f (4) > 0, 


then x, = @p_1 and yr = Shai Peead | Thus f(x,) <0 and f(yx) > 
0 and x* € [ax, yx]. The proof thus follows by induction. 


(b) Note that 


b-a 
Qk 


1 1 
Yk — Zk = 5 (Yea —X-1) = 5r (Yo = £9) = 


However, |x* — xx| < |ax — yx| < S44, since * € [xp ya). 


12. Let g(x) = b+ €h?(x). Then xp41 = b+€h?(ap) is the same as the fixed 
point iteration 7,41 = g(x,). Clearly g : R — R. Now consider, for 
x,y ER, 


lg() — g(y)| = lel |h2(z) — h2(y)] = lel |h(x) + h(y)| law) — hy) 
< 2MLI¢llz — yl <1- |e gl. 


Thus g(a) is a contraction, so there exists a unique z such that z = g(z) 
and {x;}°9 converges to z for any xo € R, by the Contraction Mapping 


Theorem. 
14. We have 
1 fa? 2 Oo 
=5|7-2°--2+4 
g(x) 5 E x get | ; 
1 5 1 
fe) = |e? 20- 3], o"(@) = Flee 2) 


First, note that g(a) has maximum and minimum values on [0,2] at 
x = 0, « = 2, or where g/(x) = 0. (However, g’(x) = 0 at x = 3 and 
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.) Thus, 4 = 9(2) < g(x) < g(0) = $ for any « € [0,2]. Thus, 
| : — [0,2]. One may also note that 


Thus, g is a contraction on [0,2]. The result now follows from the 
Contraction Mapping Theorem. 


23. Since f € C1 and x, — r we have f(x,) — f(r) and f'(xx) — f’(r). 
Hence, taking the limits, we get 


a tavgy ET as Glee IND a 
uy Pies ie. PRat ae Sid (Lp 1) ras a f(r) =a PC ) - us 
Also, using Taylor expansion, we have f(r.) = f(r) + f’(x)(as% — 1), 


where x is between x, and r. Then, f(r) = 0 gives 


If(we)l 
|a,y —T| < max 
ve(a.b} |f/(a)|” 
25. Consider the function f(z) = R—1/a. Then, f(x) = 1/x?, and New- 
ton’s method is 


f (xr) R-1/x 


a arise 9 aah 


which can be simplified to yield 2,41 = x, (2 — Rag). 


26. Let f(x) = te et dt —1. We have a single nonlinear equation to solve to 
obtain x* such that f(a*) = 0. There is a unique zero because f(0) < 0, 
f(1) > 0, and f’(a) > 0 for  € R. One way to numerically solve the 
nonlinear equation is to use Newton’s method, which for this equation 


is 
Lk 
t? 
- e dt—1 
JOO 


2 
eve 
The integral can be computed numerically using some numerical inte- 
gration method described in Chapter 6. 


Lk4+1 = Lk — 


Solutions from Chapter 3 


1. Let A be an invertible matrix. For « 4 0, let y= A~?x #0. Since A 
is positive definite, we have (y T Ay)" > 0 = [(A-P2)? A(A?2)|? > 
0 = > [a7 A-1AAT al? = (2? A-P a)? = 27 A~1z > 0. Hence, A~! is 
also positive definite. 
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2. Let A be a symmetric positive definite n x n matrix. For a given index 
j, consider a nonzero vector ¢ = (a;) be defined by 2; = 1 and a; = 0, 
if i #j. Then we have #7 AZ > 0 => aj; > 0 for each j = 1,2,...,n 


Pa 


7. (a) Let F = —A“1B then A+ B = A(I — F). Suppose I — F is 
singular, then for « 4 0, (I — F)x =0 ||Fx|| = ||2|| \|z|| < 
\la|||| Fl] => ||F | > 1, ie. || A7' Bl] > 1 a contradiction. Hence 
I—F=I1+4A~'B is nonsingular and since A is also nonsingular 
=> A+ B= A(I - F) is nonsingular. Moreover, 


(A+ B)~*|| = (AG FY)" = 10 - FY A 
Av _ AI 


< =F “NA <5 |F]  l—r 


(b) 

(A+B)? =A] = (A+ B= (A+ BAT 
= (A+B) = AA? — BAM 
= (A+ B)(-BA)| 
< (A+ By NIBIAT 
g BATE ing (a 


12. (a) B= A-C = A(I—A™!C). Also, ||A71Cll2 < ||A7+]l2llCll2 = 
10? - 10+ = = < 1. Hence, p(A7'C) < 1, and (I — A7!C) is 
nonsingular. Hence B is nonsingular. 


(b) « = A~'b and z = B~'b. Therefore, 


lz — zll2 = |A7*b — B bla = Ab - T- ANC) TA 2 
< ||Z- (2 - AVC)" 2 ||A“*b lle 
< |Z -— ATC)" |2|| A" Cllalleela 
|| A~*Cll2 
1— ||A~*C 2 


IA 


Ila'll2. 


Using part (a), we find that |x — z||2 < $llall2. Hence, c= ¢. 


13. (a) Let A = 5(J — H), where 


0 ifi=J, 
H=¢-24 ift=j+lori=j-l, 
O otherwise. 


Note that ||H loo = 2, so p(H) < ||H|loo < 1. Therefore, (I — H) 
is nonsingular, so A is nonsingular. 
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(b) Since H satisfies 


= < IE — 2) leo < 


T+ [Allo leo" 
part (a) gives 
a : : 1 7 
Ao = 3 — H)* => ||AM loo = sll — HY leo: 


Hence, + < ||A7"lloo < §-. 
14. (a) We have 
\|Z — ABg|| = |[Z — ABp-1 — ABg_-1U — ABx-1)|| 
= || — ABg-1)?|| 
II — AB al? < || — ABgall* 
ve |PSABEA Se: 


IA 


IA 


(b) Since ||J — ABo|| =c < 1, we have 
1 
1 — ||Z — ABoll 


Sah aes 1 
= ||By"A| < 


|| (i — (I ABo))™ || < 


Then as 
4-1] = [BeBe ta“ < oll 
Finally, using the part (a), 
||A~* — Bel] = ||A7*( — ABz) || < ATI — ABall, 
and the result follows. 


15. Since A is an eigenvalue of A, by definition there exists an eigenvector 
x such that Ax = Ax. Thus, 


In —cAx = 4 -— cht => (I-cA)x = (1—-CcA)a. 


Therefore, 1 — cd is an eigenvalue of (I — cA) with corresponding eigen- 
vector 2. 


16. Since p(A) < 1, then by problem 15, none of the eigenvalues of I — A 
can equal zero. Thus, (J — A) is nonsingular. By directly multiplying, 
we have (I — A)(I+ A+ A? +...+ AN) =I—AN*1. Hence, 


N 
So Ais (14+ A+ A? +...4A%) = (2 A) ANT) 
1=0 


because J — A is nonsingular. 
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Consider 


non non 1 
y’ Ay = S- Se YiQig yj = Se ay ye yjer* da 
0 


i=1 j=1 i=1 j=l 
1 n n 

=i ) yie™ ) yjer” | dx. 
0 \G=1 j=l 


Letting f(x) = S7"_, yje?*, we then have fo (f (x))? dx > 0. Let z = e*. 
Then St yjel” = D051 yj27 is a polynomial of degree n in z. But 
ee = 0 only if y; = 0 for j = 1,...,n. Thus y? Ay > 0 if y 40. 
Hence, A is positive definite and therefore has a Cholesky factorization. 


A=I-2x? = AT =A and 
A? A= (I—aa")? (I-22?) =1-2Q22" +227 22". 


Note that «7x = ||z||2 = 2. We then have ATA = I => A7} = A?. 
The condition number then is given by 


kp(A) = ||A~*|2l|All2 = ||A7llallAll2 = [IAI = p(A7 A) = p(D) = 1. 


A is strictly diagonally dominant. Hence, the Gauss-Seidel and Jacobi 
methods converge. 


Suppose ||M~!N|| <1 then J — M~!N is nonsingular. Since M is also 
nonsingular, M(I — M~!N) is nonsingular. This implies that 


A=M-N=M(I-M™N) 
is nonsingular, which is a contradiction. Hence, ||M~1N|| > 1. 


Since A is positive definite, 


ded) Sa eG Sy 


V7e 


The Jacobi matrix corresponding to the given matrix A is 


a 


The eigenvalues of J are +3/,/ya < 1. Hence, p(J) < 1, which implies 
that the Jacobi method converges. 
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52. Consider 


T =-(L+ D)"U =- & 4 a € i). 


Thus p(T1) = 3 > 1, so the method is not convergent for ¢ = 1. 
Consider 


1 Pra als » A SON ek 1 
TA(2EaD a 5 ae a a fe 
ee) GPa aa) Noa) 
Furthermore, p(T\/2) = 1/2. Hence, the method is convergent for 0 = 
1/2. 


Seo ater 
nN 


Solutions from Chapter 4 
3. Since g* € P” is the least squares approximation to f, we have 


n 


9* (x) = > (wis f)y;(2). 


i=0 


Now let p(x “dy p(x). Then 


0 
= Sl ai(f,¢) — OV ap: 
j=0 i=0 
= 0. 
4. The polynomials are 
qgo(x) = Po, 
Yo 
(2) =/1 — (Y1, Lo) ’ 
Ilvoll? 
4 q1 
q2(x) = ~2 — (pa, yo) —s ° 2 (Ya, q1) 2° 
II voll llaul| 
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Using { Po; Y1, 2} = plea} and (f,9) = So f( x)dz, we get 
1 1 
qo(x) = 1, n(z)=x—>5, oe 


The least squares approximation of f(x) = x2 on [0,1] then is 


SS citi; where c; = (f, qi) 
=) (Gi, qi) 


Using this, we get co z, C1 Z, C2 —2. Hence, 


Fle) = 540(a) + ganle) — Saale) = 


1 
rae + 482 — 20x”). 


6. (a) Follows from Theorems 4.5 and 4.6. 


(b) One can easily find that P\(x) = 3 — 4. Moreover, f”(£) = a: 
Thus, 


fle) — P(e) == - 545 =5(e-@-Doas 


Thus, solving this equation for € we get (a) = (2x)'/3. Since this 
is an increasing function we find that 


1.2599 ~ 2/3 < (x) < 4/3 1.5874, 1<a2<2 


8. Let A = (ai;) = (wi, w;). Note that a;; = aj;;. For « #0, we have 


n n 

map =) Ly y Ajj Li > Ly y (wi, Wy) Li 
j=l i=l i=l 

n n 

) Wi Xi, ) WL 

i=1 j=l 


=2z7z>0, 


I 


n 
where z = )0j_, Wit. 


11. Given € > 0, the Weierstrass approximation theorem states that there 


exists a polynomial p such that || f — pil < €/2. Let N be the degree 
of this polynomial p. Now assume n > N and consider 


|En(f)| = |En(f) — En()| 


, F@) aye — Yo Li) = p(«i)) 


fis p(a)|de +S bflas) ~ plas] 


< 2\|f — Plleo Se 


I 


14. 


15. 


26. 


27. 
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when n > N. 
Using the transformation + = 2(t + 1), we get 
P3(t) = 8t? — 12t? — 88t — 63 
= —6379 — 887; 6(T2 t To) 2(T3 t 371) 
= —69T> — 82T) — 6T2 + 27s, 


where —1 < t < 1 and T;(z) is the corresponding Chebyshev polynomial. 
If we truncate the polynomial at T2, we have 


The approximation is 


P2(t) = —69Tp — 82T, — 6T2 = —63 — 82t — 124”, 


which has maximum absolute error 6 = 2. Substituting t = (a — 2)/2, 
we get P(x) = —32? — 29x + 7. 


(a) Note that U;(a,) = 6, and L;(Ym) = 6jm. Thus 


nm n 
P(L1, Ym) =>) ake Vij(w m) = Cim; 
lt 


Therefore, p(x,y) interpolates f(x,y) at the n? points if Cyn = 
f (xt, Ym) for m,l = 1,2,...,n 

(b) Note that $7"_, li(k) interpolates 1 at x71, 2, ...,@n. Define the 
(n — 1)-degree polynomial p(x) by p(x) = S77, 1i(z) — 1. Since 
p(x) = 0 for i= 1,2,...,n, we have p(z) = 0. Thus 77", ii(x) = 


1. But 
YH) = u@ DW) = 


i=1 j=l j=l 


The piecewise polynomials s;(x) and s2(a) must satisfy s2(0) = s1(0) = 
1+c, s$(0) = s{(0) = 3c, s$(0) = s{(0) = 6c. From these, we obtain 
8o(2) =1+c+3cx 4+ 3cx? + Ax. Since s(x) is a natural cubic spline, 
we also have s$(1) = 0 => A = —c. Moreover, we also require s(1) = 
—1 => s2(1) = —-1. This then yields 1+c+3c+3c—c= -1=>c=—4 


Ss 
We have 


O0<a<1 O<i<N-1 | aj <e<ai41 


ibs ba) =P a { ee ie) -s@]} 


< max { max Lie; — a} 


O<i<N-1 | ai<e<ai4i 
< Dh. 
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35. Starting with e* = )0" xc consider 


CO ok 
a (1+ bya + box”) — (ap + a2) = Yd 
j=0 


Setting d; = 0 for 2 = 0,1,2,3, we get 1—ag = 0,14+ 0, —a, = 
0,5 +b) + b2 = 0,4 + a + or = 0. Solving these equations we get 
= z, and 


1 2 
ao 1, a, 3 by 3 bo 


ee 
Sn 
Le erp ae 


Solutions from Chapter 5 


7 


3. Using Gerschgorin’s Circle Theorem, we find that |A| < max{3, 5, $}. 


Hence, p(A) < 1. 


. Let q® = 37, Gai, where {2;}", are the orthonormal eigenvectors 


of A, where x; through 2, m1 > 1 correspond to A; = 1, and where 
Lm,4+1 tO Lm,4+mz, M2 > 1 correspond to Ag = —1. Then, 


gQ? = (>: arn) 
— mitme n 
= Sexi + (-1)" ( S- on) + bs iN; Li, 
i=1 


i=m,4+1 i=m1+mM24+1 


where the last sum is interpreted to be zero if m, + mz = n. Then, 
given € > 0, there is an N such that 


n 


i=m1+mMm2+1 


zs 
2 


for r > N, since |A;| < 1 for 7 > mi+mz2. Thus 


mitme n 
gift) z q’” = 2(—1)""? ( x on) + S- (Apt? — NN )axi, 


i=mi+m2+1 


which then gives 


llg°t? —g || 22 
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for r > N, where € can be made arbitrarily small. Hence, the method 
fails to converge. Indeed, for r large, 


m4 mi+m2 
q’) (s: on) ih)" ( S- on) 


i=1 i=mi4+1 
and q‘") oscillates and never converges. 


8. Notice that the method given in this problem is a form of deflation 
technique. Let 2 = 3°”, ca;. Then, 


gq) — 70) = (202) Ly = eee — (>: cat] Ly — em: 
i=l i=l 1=2 


In addition, 


g@t) = Ag” — (Aa)? 21) r4 
n v r+l1 
C2%2 + S- Cj ({) “| ; 
i=3 d2 
(r$1) 


Thus, lim;—.o6 (<>) = (2%, and q” goes in the direction x2 as r 
2 


n 
=> ; GN ts => pe 
i=2 


increases. 


Solutions from Chapter 6 
5. The Taylor polynomial of degree 2 for f, with error term, gives 
h3 


2 
flo +h) = Fao) + F'(wo)h + £" (0) = + F"(E0) —- 


Similarly, 


3 
F(a + 2h) = Fo) + £"(wo)(2h) + F" (wo) 2h? + "(GE 


Substituting these expansions into 


1 


J"(00) ~ 5 | $F 0) + ef leo +) + cafleo +2A)| 


and setting coefficients equal to zero except for the h? coefficient yields 


3 
Eee = 55 Cy + 2co = 1, S + 2e2 = 0, 


which can be solved to give cy; = 2, co = —4. 
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6. (a) Taylor series expansions give 


he h3 
f(xo +h) = f(xo) + f'(wo)h + P(eo)> str (0) sy +... 
and 
i h3 
Feo — h) = flo) — f"(eo)h + J" (v0) — f"(0o)ay +o 
Then, 
f(wo +h) — f(to-F) _ Ff" (0) ,2 , £© (#0) p 4 
amen See hae eae aes 
Thus, 
7 f+) (a9) ~ 
Ck = Gk+D! fork =1,2,... 
(b) Let ay = —$ and a2 = 4. (Get this by setting the lower-order 


coefficients in the error expansion to zero.) 


18. (a) Letting f(z) =1 and f(x) =x yields aj = % and a; = §. 


(b) Using the transformation t = xh, we get 


h 1 
| g(t) tin(h/t) dt =| g(xh) (ah) In(1/z) hda. 
0 0 


From part (a), this becomes 
5 1 
h? | — ~g(h)). 
(F010 + 99 )) 


b 
19. Let r= | f(a) dx. Then 


I(h) = I+eh+ coh? + O(h3) 
h h? , 
21 2 = 27 cyh OO Fr O(h ) 


2 


h h? 
31 (3) = 3I+aqh+ ret O(h?) 


Manipulating these equations one can obtain 


aI (3) —Al (=) + a) =1+O(h*). 


Solutions to Selected Exercises 583 


21. The formula will be exact for polynomials of degree less than equal to 


5. Hence, choosing f(x) = 1, x, 2”, 2°, a4, x° yields the following 


equations: 
3 3 
Ser = TT, yA =, 
al t=1 
3 a 3 
t=1 t=1 
3 Qn 3 
4=1 1=1 


Solving these equations gives 


1 
1 a| v3 v3 
—— dr x — a we 
| zSlowss [ir )+ 10405) 
The x;’s may be computed either with the technique in Remark 6.11 on 
page 347 or by observing that the Chebyshev polynomials are orthogonal 
with respect to the dot product 


9) = i. Fae ee: 


In particular, 


A eer 
2 2 
n-1n-1 
ff sew ) da dy — 2 Gs) 
n-1n-1 i+1)h p(j+l)h 
“Soh fh 

n-1m-1 pith pit) 

) OF Pas Si) . ) 49 S7 ci 


Using the Mean Value Theorem for integrals, this becomes: 


n-1n-1 ae a 
> oe fir &j) ho + hs) | 


1=0 7=0 
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We then use the Intermediate Value Theorem to obtain 


h faring) , afin’) 
2 Ox Oy ‘ 


which gives the result. 


tt 24 
The integral if / z'y! dx dy = 0 when i and/or j is odd. For 
-1J/-1 


i = j = 0, the method is exact. For f(a,y) = 2?y?, we obtain = = 
4at > a= tq: 


For Monte Carlo calculations, 


prob (Ex F09 ay f(x)dax 


We require 


IA 
al” 
> 
eae 
» »- 

fay) 

| 

“Po 

Q 

8 


< 42 
~ vn 


ey pe er 
— e 2?dz>0. x 
/2r ie > 


Solving this equation by using the error function (such as using the 
MATLAB function erf or by looking it up in a table), we see that A = 3 
is sufficient. We thus obtain the inequality sy < 0.001 for this problem. 


But 
2 


1 1 
oF =} eda — (| eae) € (0.24203, 0.24204]. 
0 0 


Hence, \/n > 30000 or n > 9x 10®o?, and 9 x 10%o? < 5.28 x 10°. Thus, 
for this problem, to reduce the error to less than 0.001 with probability 
0.997, n > 2.19 x 10° is sufficient. 


Solutions from Chapter 7 


3. 


Since f(x,y) does not satisfy a Lipschitz condition for x > 0, there is 
no guarantee that Euler’s method will converge to the solution. Indeed, 
for this problem, there are an infinite number of solutions. One solution 
is y(z) = 0 which Euler’s method appears to approximate. However 


=| 0, 0<2x<a, 
ac (2(z@-a))?,  x>a 


is a solution for any positive number a. 


10. 


. Clearly f(t, y) 
(ty 
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& cos(2y) + ¢? is continuous in t and y. We need to 


show that f(t, y) sath: a Lipschitz condition in y. We have 


. t > 10... - 
f(é, 4) — F(t WI = |g (cos(2y) — cos(29))| < [2 sin(28)|ly — gl 
using the Mean Value Theorem. Thus, 


ft.) — FED < Shy aL. 


Therefore, there is a unique solution to this initial-value problem for 
0<t< 10. 


. Notice that 


ultnet) = v(ln) + Af (ultn)s te) + Sy" (En). 


Let En = |y(tn) — yn|. Then Enyi < En +hLEn +5+ 42M. Using a 
telescoping argument, we get 


h2 
Fag < (5+ =) [l+(1+hL)+(1+hL)?+...4+(1+hL)"] 
h? (1-+-AL)?t+1 -1 
7 (s . <M) aa 


This then implies 


he Vi A S35 
< ee ee 
By < (6+5m) AL 5 


— "| ah arte! ee | 


Let 21 = y, 22 = y', 23 = y”. Then we have z} = 22, 24 = 23, 24 = 
t+ 2tz3 + 2t?z, with 21(1) = 1, z2(1) = 2, 23(1) = 3. Euler’s method 
has the form 


Zit = % + AF (ti, 2%), 7=0,1,2,.... 


We have h = 0.1, to = 1, zo = [1 2 3]”. Thus, 


21 = 20 + 0.1f(1, 20) = [1.2 2.3 3.9]? and 
zo = 2, +0.1f(1.1, 21) = [1.43 2.69 5.1584]. 
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e h 
geet — Hal] = |ge+ gif tia, yiti) + f(t, ya)] — 9 


- AFG) + fbi, Ge] 


n h 7 
< |yi — vil + 5 [Ff (ter, yerr) — f(tes1, Fi41)| 


+ A F(tisw) — f (ti, 9) 


IA 


z hr R hr . 
lye — il + > lyita G+ 4 > li Ui 
1+hd/2 : 

Ss T-pa/alv = Gj) 
< erh/2 ly, _ Hi 


for Ah < 1. Continuing, 
ella, — Gi] < (€?*/?)**" Iwo — Gol = (€*47/) 90 — Gol. 
Hence, 
lye — Gil < P*/? lo — Gol < eX |yo — Gol, where K = 3AT/2. 
22. (a) Note that the exact solution satisfies 


y(teo2) + ary(tet1) + aoy(te) = hBy' (te+2) + hr 


where 7 is the local truncation error. Hence, 


ht = y(tro2) + ary(teti) + aoy(te) — hBy'(tr+2). 


Using the Taylor expansions of y(tx+2), y(te+1), and y’(th42) about 
t = tr, we have 


hr = [(l +a, + ao]y(te) + h{2 + a1 — Bly’ (tr) 
+h?[2-+ 5 — 28ly" (tx) + O(h%). 
For the method to be second order, we need 
eer, | Ry See ee, ee a | 2+ = 28 =0, 
which yields 6 = z, a= —§, ao = z. 
(b) For consistency note that ao + a1 + a2 = 0 where a2 = 1 and also 


a, + 2a2 = 0. Hence the method is consistent. 


(c) For stability consider p(z) = 4—4z+z? = 0 which implies z = 4,1. 


Hence |z;| < 1 and the method is stable. 
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Solutions from Chapter 8 


3. There is a 6 such that 


14. 


FY) — F@)|l = IlF@) — F@) — F(a)(y - &) + F'(2)(y— 2)| 
S ||F@) — F(a) — F'(a)(y- 2) + IF) I - @)Il 
S lly- all + IF"@)I MI — 2)I 
for all ||y — a|| < 6. (Existence of a 6 such that 


|F(y) — F(a) — F'(@)(y - 2)II < lly - | 


for |ly — a|| < 6 follows from the definition of the Fréchet derivative.) 
Thus 


IF) — F(@)ll < 1+ |F"(2)I)) ly - all < 
when 
ly — | <min {3 eo} 


Thus, ||F'(y) — F(2)|| < € when ||xz — y|| < 6. Hence, F is continuous at 
ti 


. Consider t+) = G(a™)) = Ba) +b. Then there exists a matrix 


norm ||.|| such that ||B|| < p(B) +e=7 <1. G:R” > R” andGisa 
contraction on R”, since ||G(«) — G(y)|| = || B(#— y)|| < || Bl|||x— y|| for 
z,y € R”. Therefore, by the Contraction Mapping Theorem, G has a 
unique fixed point «* € R” and {ate converges to «*. Now suppose 
that p(B) > 1. Let A be an eigenvalue of magnitude p(B), i-e., |A| > 1, 
and let x be its associated eigenvector. Then Ba = Ax. Suppose that 
© =a and b=0. Then c™ = \*¥z© for k =0,1,2,.... This implies 
that p(B) <1 is necessary for convergence for any x € R”. 


We have 
_ — jet —a +22, —-1 
F(a) =Vf(«) = E Lo ee ‘| and 
er+2 —|1 
V2 — = F’ . 
f= |" 5 |=PO 
Note that: 


(i) F(a) is continuously differentiable on R?, since all elements of F(2) 
are continuous. 
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(ii) F’(x) exists on R? and is nonsingular on R? because V? f(z) is 
strictly diagonally dominant on R?. 

(iii) One may verify that (F’(x))~' > 0 on R? (for example, by comput- 
ing the inverse symbolically and examining the resulting elements). 

(iv) Vf(x) = 0 has a solution at 7) = ry = 0. 


(v) Note that 


But e? — e* > e*(b —a) for all a,b € R, so F is convex. Therefore, 
the Newton Iterates converge for any x) € R?. 


Solutions from Chapter 9 
3. We have 


Thus, the Davidon—Fletcher—Powell method satisfies the quasi-Newton 
equation. 


. Note that 


f(te+1) = f(ae — teV f(ee)) 
= vr(te) = min yr(t) = min f (ry —tV f(xx)) 


< f (zx). 


If Vi (xr) = 0, then Lk+1 = Xk, SO f (e441) = f(r). If f (e441) = 
f(xx), this implies that ¢ = 0 is a minimum of y;(t). Therefore, 
y',(0) = 0. But w,(0) = (Vf (re))’ Vf (ae) using the chain rule. Hence, 
(Vif (xe) VE (wn) = 0, so VF (ee) = 0. 
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Solutions from Seger 10 
1. (a) Using z = fy we have the following system of first order equations: 


dy 


Ys =Fley2), — yQ)=1 
dz 6y 
de ge? EY?) 2(2) =0 


Using Euler’s method, we get 
Yit1 = Yi thf (xi, yi, 2%), Zip. = 2 + hg( ari, yi, 2). 


With the given initial conditions, one can then calculate y; = 1, 
A= 3, y2= 3 = y(4). 

(b) Starting from z(2) = 3 and repeating the steps in part (a), we get 
mn =4,2=3,y=%=y(4). 

(c) Let the guess be G to obtain the desired result y(4) = 8. Using 
Lagrange interpolation, we get 


W0= 53 (5) +30 (3): 


Solving for the guess, we get G = i. 


3. First note that 


y(@i41) — ute) + y(@i-1) ple) = y(zi) 
=r(a;) +7, 


where ‘ 
j= ry *(G) — plea) 59O (ni) 


for some &;,7; € [%;-1, Vi41]. Subtracting the difference equation from 
this expression, letting «; = y(a;) — u;, and rearranging the expression 
yields 


(3 + q(aj)h7)e; = (1 or hp(a@i))€s41 + + hp(ai))é + €j-1 
OG)+ py Plwidye (mi). 
Assuming h < 1/P* where 


= |p(z)| and Qs = min a(e z) >0, 


letting |e] = max; |e;|, and simplifying the resulting expression yields the 
result. 
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10. We have : 
ft) = alts) + ff K (ti 8) F(s)as 
and o 
Fy = g(ti) +h) wi K (te, ty) Fy. 
j=0 


Subtracting, we get 


1 
-r= | K (ti, s)f rte bw (ti, t;) (t;) 
0 
N 
+ hS > w;K (ti, tj VF (t; ) _ E55 
j=0 
which implies 


|f(ti) — Fi] S< eh? hd ouylK (te tPF (Es) — Fil 


j=0 


1 N 
< ch? +5) hwy f(ts) — Fl 
j=0 


eR 


Sch? + 5 max |f (ts) — Fl. 


The result then follows by taking the maximum over all 7 and combining 
the absolute value terms. 
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BFGS update, 496 
BFP, 535 
bifurcation point, 481 
big O notation, 5 
bisection 
method of, 36 
black box function, 58 
black box system, 459, 479 
block SOR method, 161 
Bolzano—Weierstrass Theorem, 194 
bound constraints, 487 
boundary value problem, 535 
boundary value problems 
finite difference methods, 540 
shooting method, 536 
bounded variation, 250 
bounding step, 523 
branch and bound algorithm, 74, 374, 
523 
clustering effect, 530 
for nonlinear systems of equa- 
tions, 481 
branching step, 523 
Brouwer fixed point theorem, 471 
Broyden update, 474 
Broyden’s method, 475 
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Broyden—Fletcher—Goldfarb-Shanno up- constrained optimization, 487 


date, 496 


Cauchy polygon method, 385 
Cauchy sequence, 40, 197 
Cauchy—Schwarz inequality, 90 
central difference formula, 324 
Central Limit Theorem, 369 
centroid 
of a simplex, 498 
chapeau functions, 547 
characteristic equation, 87 
characteristic polynomial, 291 
Chebyshev norm, 191 
Chebyshev polynomials, 217, 224 
Chebyshev’s Equi-Oscillation Theo- 
rem, 234 
Cholesky factorization, 117 
chop, 11 
clamped spline interpolant, 246 
clustering effect, branch and bound 
algorithm, 530 
code list, 329 
collocating basis, 212, 240 
companion matrix, 74, 170 
compatible matrix and vector norms, 
96 
complete search algorithms, 519 
complete space, 40 
complex Fourier series, 250 
complex reflector, 135 
composite integration, 334 
condition 
ill, 122 
number 
generalized, 180 
of a function, 16 
of a matrix, 123 
perfect, 124 
Wilkinson polynomial, 73 
consistency 
of a k-step method, 409 
of a method for solving an IVP, 
391 
consistently ordered matrix, 158 


constraint programming, 529 
constraint propagation, 527 
constraint satisfaction problem, 488 
continuation method, 481 
continuity 
modulus of, 230 
contraction, 40 
Contraction Mapping Theorem, 442 
in one variable, 40 
convergence 
global, 454 
iterative method for linear sys- 
tems, 142 
linear, 7 
local, 454 
of a sequence of matrices, 102 
of a sequence of vectors, 92 
of the SOR method, 147 
order of, 7 
quadratic, 7 
rate 
asymptotic, 161 
semilocal, 454 
superlinear, 8 
convex 
function, 455 
program, 488, 515 
set, 444 
strictly, 475 
correct rounding, 23 
critical point 
of a constrained optimization prob- 
lem, 502 
of an unconstrained optimization 
problem, 492 
cubic spline, 243 
cyclic matrix, 158 


Daubechies scaling function, 278 
Davidon—Fletcher—Powell update, 496 
defective matrix, 292 
deflation 
eigenvalues of matrices, 305 
roots of polynomials, 72 
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dependency, interval, 27 
derivative tensor, 441 
Descartes’ rule of signs, 70 
DFP update, 496 
diagonally dominant, 150 

irreducibly, 150 

strictly, 111, 150 
differentiation 

automatic, 328 
dilates, 270 
dimension 

of a vector space, 89 
Dirichlet problem, 535 
discretization error, 387 
distance 

in a normed space, 92 
divided difference 

k-th order, 213 

first order, 63, 213 

Newton’s backward formula, 215 

Newton’s formula, 214 

second order, 63 
domain 

frequency, 258 

time, 258 
Doolittle algorithm, 185 
double QR algorithm, 312 
dual problem, 502 
dual variables, 502 
dynamic programming, 515 


eigenvalue, 86, 291 
simple, 297 
eigenvector, 86, 291 
elementary row operations 
for linear systems, 88 
in the simplex method, 508 
equi-oscillation property 
minimax, 234 
equilibration, row, 125 
equivalent norms, 93 
error 
absolute, 14 
backward analysis, 126 
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bound for a single-step method 
for IVP, 393 
forward analysis, 126 
local truncation 
for a single-step method for 
IVP’s, 391 

method, 9 

relative, 14 

roundoff, 9 

roundout, 25 

truncation, 9 
Euclidean norm, 90 
Euler’s method, 385 
Euler—Maclaurin formula, 355 
Euler-Trapezoidal Method, 417 
excess width, 27 
Exchange Method of Remez, 235 
explicit method 

for solving IVP’s, 406 
explicit single-step method, 389 
extended precision, 389 
extended real numbers, 24 


Fast Fourier Transform, 256 
fathomed, 526 
feasible point, 507 
feasible set, 488 
feasible solution, 507 
FFT, 256 
Fibonacci sequence, 62 
filtering, 258 
finite difference methods, 540 
finite element method, 547 
fixed point, 39, 442 

iteration method, 39 
floating point numbers, 10 
forward difference formula, 215, 323 
forward error analysis, 126 
forward mode, automatic differenti- 

ation, 328 

Fourier series, 204 

complex, 250 
Fréchet derivative, 440 
Fredholm integral equation 

of the second kind, 554, 559 
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existence and uniqueness, 560 
numerical solution of, 562 
frequency domain, 258 
Fritz John conditions, 503 
Frobenius norm, 96 
full pivoting 
Gaussian elimination, 114 
full rank matrix, 86 
Fundamental theorem of algebra, 69 
fundamental theorem of interval arith- 
metic, 26 


GA’s, 520 
Galerkin method, 545, 549 
for Fredholm integral equations 
of the second kind, 563 
functional analysis setting, 549 
Gauss—Legendre quadrature, 348 
Gauss-Seidel method, 143 
Gauss—Seidel-Newton method, 461 
different from Newton—Gauss-Sei- 
del Method, 462 
Gaussian elimination, 105 
full pivoting, 114 
partial pivoting, 114 
Gaussian quadrature, 343 
2-point, 344 
definition, 344 
error term, 349 
Gauss—Legendre rules, 348 
generalized condition number, 180 
Generalized Rolle’s Theorem, 216 
genetic algorithms, 520 
Gerschgorin’s Circle Theorem, 293 
for Hermitian matrices, 296 
Givens rotation, 137, 316 
global convergence, 454 
global minimizer, 489 
global optimization, 489, 518 
global truncation error, 387 
golden mean, 491 
golden section search, 490 
Gram matrix, 198 
Gram-Schmidt process, 138, 201 
modified, 139, 171 
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graph of a matrix, 157 


Haar basis, 274 

Halton sequence, 371 

harmonic analysis, 258 

hat functions, 239, 547 

Hermite interpolating polynomial, 220 

Hermite interpolation, 220 

Hermite polynomials, 224 

Hermitian matrix, 87, 295 

Hessenberg matrix, 307 

Hessian matrix, 492 

heuristic, 497 

Hilbert matrix, 125 

Hilbert space, 197 

homotopy method, 480 
predictor-corrector method, 481 

Horner’s method, 71 

Householder transformation, 133, 305 

HUGE, 20 


identity matrix, 86 
IEEE arithmetic, 19 
ill-conditioned, 122 
implicit method 

for solving IVP’s, 406 
implicit Simpson’s method, 405 
implicit trapezoid method, 434 
improper integrals, 365 
improved Euler method, 397 
independent, linearly, 89 
infeasible linear program, 513 
infimum, 194 
infinite integrals, 365 
initial value problem, 381 
inner product, 91 

real space, 195 
integration, 333 

composite, 334 

infinite, 365 

midpoint rule, 341 

Monte Carlo, 369 

multiple, 364 

singular, 365 
interior point methods, 507 
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interpolating polynomial 
Hermite, 220 
Lagrange form, 210, 212, 338 
Newton form, 214 
interpolation 
hermite, 220 
interval arithmetic 
fundamental theorem of, 26 
operational definitions, 24 
interval dependency, 27 
interval extension 
first order, 28 
mean value, 33 
multivariate mean value, 467 
second order, 28 
interval Gauss-Seidel method 
nonlinear, 465 
interval Newton 
operator, 464 
univariate, 54 
interval Newton method 
multivariate, 463 
quadratic convergence of, 57 
univariate, 55 
inverse 
of a matrix, 86 
inverse midpoint matrix, 154 
inverse power method, 303 
invertible matrix, 86 
irreducible, 150 
irreducible graph, 157 
irreducibly diagonally dominant, 150 
iterative refinement, 127 
IVP, 381 


Jacobi diagonalization, 315 
Jacobi method, 143 
for computing eigenvalues, 315 
nonlinear, 461 
Jacobi rotation, 316 
Jacobi-Newton Method, 461 
Jacobian matrix, 440 


Kantorovich Theorem, 454 
Kantorovich theorem, 51 
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Karmarkar’s algorithm, 507 
kernel 

of an integral equation, 554 
Krawczyk 

operator, 471 
Krawczyk method 

multivariate, 470 

univariate, 81 
Kronecker delta function, 92 
Krylov subspace, 169 
Kuhn—Tucker equations, 501 
Kuhn-Tucker point, 502 


L-2 norm, 192 
Lagrange 
basis, 210, 212 


polynomial interpolation, 210, 212, 


324, 338 
Laguerre polynomials, 224 
Lanczos Algorithm, 172 
Lax—Milgram lemma, 550 
least squares 
approximation, 140, 198, 279 
general least squares problem, 222 
norm, 192 
left singular vector, 175 
Legendre polynomials, 203, 223 
Leibnitz rule, 264 
Lemaréchal’s technique, 532 
line search, 490 
golden section, 490 
linear convergence, 7 
linear least squares problem, 279 
linear program, 487, 488 
infeasible, 513 
standard form, 504 
unbounded, 513 
linear programming 
interior point methods, 507 
linear relaxation, 527 
linearly independent, 89 
Lipschitz condition, 40, 382, 555 
Lipschitz matrix, 463 
local convergence, 454 
local minimizer, 489 
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local truncation error 


finite difference methods for bound- 


ary value problems, 542 
of a method for solving IVP’s, 
391 
low discrepancy sequence, 371 
LU 
decomposition, 109 
factorization, 109 


machine constants, 20 
machine epsilon, 20 
mag, 156 
magnitude (of an interval), 156 
mantissa, 10 
matrix norm, 95 
compatible, 96 
Frobenius, 96 
induced, 97 
natural, 97 
max norm, 191 
mean value interval extension, 33 
multivariate, 467 
mean value theorem 
for integrals, 2 
multivariate, 441 
univariate, 3 
method error, 9 
method of bisection, 36 
midpoint method 
for solution of initial value prob- 
lems, 390 
midpoint rule 
for quadrature, 341, 342, 349 
mig, 156 
mignitude (of an interval), 156 
minimax approximation, 230 
minimax equi-oscillation property, 234 
Miranda’s Theorem, 468 
mixed boundary conditions, 544 
modified Euler method, 397 
modified Gram-—Schmidt procedure, 
139 
modified Gram—Schmidt process, 171 
modulus of continuity, 230 
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monic polynomial, 218 
Monte Carlo integration, 369 
Monte Carlo method 
quasi, 371 
Moore—Penrose pseudo-inverse, 176 
Muller’s method, 63 
multi-stage decision processes, 515 
multiple integrals, 364 
multistep method, 405 
multivariate interval Newton opera- 
tor, 464 
multivariate mean value theorem, 441 


NaN, 20 
natural or induced matrix norm, 97 
natural spline, 246 
near minimax approximation, 235 
Nelder—Mead simplex method, 497 
Newton’s backward difference formula, 
215 
Newton’s divided difference formula, 
214 
Newton’s forward difference formula, 
215 
Newton’s method 

convergence of, 50 

multivariate, 447 

local convergence of, 450 

univariate, 49 
Newton—Cotes formulas, 336 
closed, 336 
open, 336 
Newton—Gauss—Seidel Method, 460 
different from Method, 462 
Newton—Gauss—Seidel method 


Method, 462 
Newton—Kantorovich Theorem, 454 
Newton-SOR iteration, 1-step, 460 
node 

of a graph, 157 
nondefective matrix, 292 
nonlinear interval Gauss-Seidel method, 

465 

nonlinear Jacobi method, 461 


different from Gauss—Seidel-Newton 
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nonlinear program, 487 
nonsingular matrix, 86 
nonsmooth optimization, 532 
norm 
L?, 192 
Chebyshev, 191 
equivalent, 93 
Euclidean, 90 
important ones on C”, 90 
least squares, 192 
matrix, 95 
compatible, 96 
Frobenius, 96 
induced, 97 
natural, 97 
max, 191 
of a vector space, 88 
uniform, 191 
normal distribution 
standard, 214 
normal equations, 140, 198 
normed vector space, 88 
not a number, 20 
NP-complete problem, 488 


objective function, 487 
operator overloading, 333 
optimization 
constrained, 487 
convex, 488 
unconstrained, 487 
optimizer, 488 
optimizing point, 488 
optimum, 488 
order 
of a single-step method for solv- 
ing an IVP, 391 
of convergence, 7 
order of accuracy 
multistep method, 407 
predictor-corrector method, 417 
origin shift, 308 
orthogonal complement, 199 
orthogonal decomposition, 132 
orthogonal polynomials, 222, 345 
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orthogonal projection, 200 
Ostrowski, a lemma of, 449 
outward rounding, 25 
overflow, 20 

overloading, operator, 333 
overrelaxation factor, 145 


Padé approximant, 260 
Padé methods 

for IVP’s, 424 
partial pivoting 

Gaussian elimination, 114 
pattern search algorithms, 497 
perfectly conditioned, 124 
perpendicular (from a point to a set), 

200 

Perron—Frobenius theorem, 150 
plane rotation, 137, 316 
Poincare’s inequality, 553 
polynomial time algorithm, 488 
positive 

definite, 87 

semi-definite, 87 
preconditioning, 154 
predictor-corrector method 

for a homotopy method, 481 

for systems of ordinary differen- 

tial equations, 416 

primal variables, 502 
product formula, 364 
projection operator, 199 
property A, 158 
pseudo-inverse, 176 


QR 
decomposition, 132 
factorization, 132 
method, 307 
convergence of, 309 
double, 312 
with origin shifts, 308 
quadratic convergence, 7 
quadratic programming, 514 
quadrature, 333 
composite, 334 
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Gaussian, 343 
2-point, 344 
definition, 344 
error term, 349 
Gauss—Legendre rules, 348 
midpoint rule, 342 
midpoint rule, 341, 349 
Newton—Cotes, 336 
product formula, 364 
rectangle rule, 364 
Simpson’s rule, 342 
symmetric rule, 363 
trapezoidal rule, 342 
quasi-Monte Carlo method, 371 
quasi-Newton 
Davidon-Fletcher—Powell update, 


496 
equation, 473 
method, 473 


quasi-random sequence, 371 


R-stage Runge-Kutta method, 397 
rank 

of a matrix, 86 
rational approximation, 259 
Rayleigh quotient, 300 
real inner product space, 195 
rectangle rule, 364 
reducible, 150 
reflector 

complex, 135 
relative error, 14 
relaxation, 524 

linear, 527 
relaxation direction, 164 
Remez Algorithm, 235 
residual vector, 127, 163 
Richardson extrapolation, 354 
right singular vector, 175 
Rolle’s Theorem, 216 

Generalized, 216 
Romberg integration, 355 
round, 11 

down, 19 

to nearest, 19 
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to zero, 19 

up, 19 
rounding modes, 19 
roundoff error, 9 

in Gaussian elimination, 126 
roundout error, 25 
row equilibration, 125 
Runge’s function, 219, 284 
Runge-Kutta method, 396 

fourth order classic, 398 

R-stage, 397 

stability of, 398 


scaling function 

Daubechies, 278 
Schur decomposition, 101, 292 
Schwarz inequality, 90 
search tree, 526 
secant equation, 473 
secant method, 59 

convergence of, 60 
secant update, 479 
semi-definite, 87 
semilocal convergence, 454 
sequence 

Cauchy, 197 
Sherman—Morrison formula, 476 
shooting method, 536 
significant digits, 17 
similarity transformation, 292 
simple eigenvalue, 297 
simplex method 

of linear programming, 507 

of Nelder and Mead, 497 
Simpson’s method 

for IVP, 405 
Simpson’s rule, 342 
simultaneous iteration, 318 
single use expression, 27 
single-step methods, 389 
singular integrals, 365 
singular vector 

left, 175 

right, 175 
slope matrix, 464 
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smoothing polynomials, 281 
smoothness, 373 
solution set, 131 
SOR 
matrix, 146 
method, 145 
SOR method 
block, 161 
convergence of, 147 
span, 89 
sparse matrix, 157 
spectral radius, 87, 291 
spectrum, 291 
spline 
B-, 243 
clamped, 246 
cubic, 243 
natural, 246 
stability 
A(0), 425 


predictor-corrector method, 418 


Runge-Kutta methods, 398 


standard form, linear program, 504 


standard normal distribution, 214 
steepest descent method, 493 
Steffensen’s method, 68 
Stein’s Theorem, 147 
Stein—Rosenberg theorem, 149 
stiff 

system of ODE’s, 419, 422 
strictly convex, 475 


strongly connected directed graph, 157 


subdistributivity, 25 
subspace 

Krylov, 169 

of a vector space, 89 
subspace iteration, 318 
successive overrelaxation, 145 
successive relaxation method, 143 
SUE, 27 
superlinear convergence, 8 
symmetric matrix, 87 
symmetric quadrature rule, 363 
synthetic division, 71 


tape, 329 
Taylor polynomial 
approximation by, 209 
multivariate, 442 
Taylor series methods 
for solving IVP’s, 395 
Taylor’s theorem, 2 
tensor 
derivative, 441 
time domain, 258 
TINY, 20 
total step method, 143 
tractable problems, 488 
trapezoid method 
implicit, 434 
trapezoidal rule, 340, 342 
triangle inequality, 89, 90 
for 2-norm, 91 
when strict inequality, 485 
triangular 
decomposition, 109 
factorization, 109 
truncation error, 9 
global, 387 
Tschebycheff polynomials, 217 
two-cyclic matrix, 158 
two-point compactification, 24 


unbounded linear program, 513 
unconstrained optimization, 487 
underflow, 20 
underrelaxation factor, 145 
uniform norm, 191 
unitary matrix, 124 
update 

quasi-Newton, 495 


Vandermonde matrix, 211, 338 
variation, bounded, 250 
vector space 
normed, 88 
Volterra integral equation 
of the second kind, 554 
numerical solution of, 557 


wavelet, 273 
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weak formulation, 544 

Weierstrass approximation theorem, 
205 

Wilkinson polynomial, 73, 82 

Wilkinson Prize, 82 
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